APM: Re: Regular Expression Guru's anyone?

Mike Stok mike at stok.co.uk
Thu Oct 10 20:04:40 CDT 2002


On Thu, 10 Oct 2002, Wayne Walker wrote:

> This will work (at least for my test data)
> 
> 
> guru.pl:
> 
> #!/usr/bin/perl
> 
> $/ = undef; # unset record separator to read in entire file at once
> 
> use strict; # the only way to write perl :)
> 
> my ($data, $newdata, $text, $tag);
> 
> $data = <DATA>; # Read in all the lines following __DATA__
> 
> # Break the string into 3 pieces:
> # text before a tag, tag, everything following the tag
> # leading non < characters, < all non >chars up to next >, everything else.
> while ( $data =~ /^([^<]*)(<[^>]*>)(.*)$/s)
> {
>     # Lazy man's way to grab 3 vars at a time :)
>     ($text, $tag, $data) = ($1, $2, $3);
>     # Fix the text
>     $text =~ s/bird/Hawk/gs;  # Globally change, treat as a single line
>     # Add the text and the tag to the $newdata string
>     $newdata .= $text . $tag;
> }
> # take whatever is left when there are no more tags and fix it and
> # append it to $newdata
> 
> $data =~ s/bird/Hawk/gs;
> $newdata .= $text;
> 
> print $newdata;

Or if you're being really lazy

use HTML::TokeParser;

my $parser = HTML::TokeParser->new(*DATA);

while ($token = $parser->get_token) {
    next unless $token->[0] eq 'T';     # Text?
    while ($token->[1] =~ /bird/g) {
        print "found $`>>>$&<<<$'\n";
    }
}

__DATA__
this is some text about a bird, a bird is cool, here is a picture of a
bird <img src='bird.jpg'>

[etc...]

As long as you don't want to find stuff like bi<!-- bird -->rd, in which 
case you need to concatenate the text fragments and deal with them all at 
the end.

Mike

> On Thu, Oct 10, 2002 at 03:49:13PM -0500, David Lyons wrote:
> > Here is what I am trying to do, I need to match text that is in an html 
> > document but specifically not inside an HTML tag, ie:
> > 
> > matching the word bird:
> > 
> > this is some text about a bird, a bird is cool, here is a picture of a 
> > bird <img src='bird.jpg'>
> > 
> > would hit on the two instances of "bird" but not on the one in the img 
> > tag (or any other HTML tag for that matter).
> > 
> > Thanks,
> > D

-- 
mike at stok.co.uk                    |           The "`Stok' disclaimers" apply.
http://www.stok.co.uk/~mike/       | GPG PGP Key      1024D/059913DA 
mike at exegenix.com                  | Fingerprint      0570 71CD 6790 7C28 3D60
http://www.exegenix.com/           |                  75D2 9EC4 C1C0 0599 13DA




More information about the Austin mailing list