SPUG: Regex question
Chris Wilkes
cwilkes-spug at ladro.com
Wed Apr 21 20:31:44 CDT 2004
On Wed, Apr 21, 2004 at 02:16:08PM -0400, Marc M. Adkins wrote:
> Let's say I'm parsing HTML using regular expressions and I want to find:
>
> <tag> ... </tag>
>
> where the text within the tag body does _not_ contain the word 'Alpo'.
>
> I've seen the following as a solution:
>
> $text =~ m|<tag>(?:(?!Alpo).)*</tag>|;
>
> This seems really compute-intensive. Is there a better way?
I would check out HTML::Parser to do this job for you. In particular
HTML::LinkExtor pulls out links from HTML documents, just modify the
code to pull our your <tag>.
Then you can loop through all the matches and pull out the non-Alpos
with a simple regexp.
You're going to run into a lot of unforseen problems by trying to roll
your own regexp. For example, this is valid:
<tag attr="NotDogFood"/>
but you're not going to get it as you're looking for "</tag>" to finish
it off. Or what if the $text is more than one line? And shirely
there's no malformed HTML out there either .... the list goes on and on.
Chris
More information about the spug-list
mailing list