SPUG: Regex question

Chris Wilkes cwilkes-spug at ladro.com
Wed Apr 21 20:31:44 CDT 2004


On Wed, Apr 21, 2004 at 02:16:08PM -0400, Marc M. Adkins wrote:
> Let's say I'm parsing HTML using regular expressions and I want to find:
> 
>         <tag> ... </tag>
> 
> where the text within the tag body does _not_ contain the word 'Alpo'.
> 
> I've seen the following as a solution:
> 
>         $text =~ m|<tag>(?:(?!Alpo).)*</tag>|;
> 
> This seems really compute-intensive.  Is there a better way?

I would check out HTML::Parser to do this job for you.  In particular
HTML::LinkExtor pulls out links from HTML documents, just modify the
code to pull our your <tag>.

Then you can loop through all the matches and pull out the non-Alpos
with a simple regexp.

You're going to run into a lot of unforseen problems by trying to roll
your own regexp.  For example, this is valid:
  <tag attr="NotDogFood"/>
but you're not going to get it as you're looking for "</tag>" to finish
it off.  Or what if the $text is more than one line?  And shirely
there's no malformed HTML out there either .... the list goes on and on.

Chris



More information about the spug-list mailing list