APM: Re: Regular Expression Guru's anyone?
Mike Stok
mike at stok.co.uk
Thu Oct 10 20:04:40 CDT 2002
On Thu, 10 Oct 2002, Wayne Walker wrote:
> This will work (at least for my test data)
>
>
> guru.pl:
>
> #!/usr/bin/perl
>
> $/ = undef; # unset record separator to read in entire file at once
>
> use strict; # the only way to write perl :)
>
> my ($data, $newdata, $text, $tag);
>
> $data = <DATA>; # Read in all the lines following __DATA__
>
> # Break the string into 3 pieces:
> # text before a tag, tag, everything following the tag
> # leading non < characters, < all non >chars up to next >, everything else.
> while ( $data =~ /^([^<]*)(<[^>]*>)(.*)$/s)
> {
> # Lazy man's way to grab 3 vars at a time :)
> ($text, $tag, $data) = ($1, $2, $3);
> # Fix the text
> $text =~ s/bird/Hawk/gs; # Globally change, treat as a single line
> # Add the text and the tag to the $newdata string
> $newdata .= $text . $tag;
> }
> # take whatever is left when there are no more tags and fix it and
> # append it to $newdata
>
> $data =~ s/bird/Hawk/gs;
> $newdata .= $text;
>
> print $newdata;
Or if you're being really lazy
use HTML::TokeParser;
my $parser = HTML::TokeParser->new(*DATA);
while ($token = $parser->get_token) {
next unless $token->[0] eq 'T'; # Text?
while ($token->[1] =~ /bird/g) {
print "found $`>>>$&<<<$'\n";
}
}
__DATA__
this is some text about a bird, a bird is cool, here is a picture of a
bird <img src='bird.jpg'>
[etc...]
As long as you don't want to find stuff like bi<!-- bird -->rd, in which
case you need to concatenate the text fragments and deal with them all at
the end.
Mike
> On Thu, Oct 10, 2002 at 03:49:13PM -0500, David Lyons wrote:
> > Here is what I am trying to do, I need to match text that is in an html
> > document but specifically not inside an HTML tag, ie:
> >
> > matching the word bird:
> >
> > this is some text about a bird, a bird is cool, here is a picture of a
> > bird <img src='bird.jpg'>
> >
> > would hit on the two instances of "bird" but not on the one in the img
> > tag (or any other HTML tag for that matter).
> >
> > Thanks,
> > D
--
mike at stok.co.uk | The "`Stok' disclaimers" apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike at exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA
More information about the Austin
mailing list