SPUG: extracting text between <a> and </a>
Matt Tucker
tuck at whistlingfish.net
Thu Oct 5 13:04:35 CDT 2000
-- Todd Wells <toddw at wrq.com> spake thusly:
> I'm working on a little web automation routine and I've used
> HTML::LinkExtor to extract the links from a web page, then I'm
> processing each of those links.
>
> What I'd like to know is if there's some easy way that I could get the
> original text that accompanied that link. e.g., <a href =
> "http://thislink"> this text here I want </a>.
You could do this with a simple regex. It ignores the structure of the
document, but if you don't mind that then this should work:
while ($html =~ /<[aA]\b[^>]*>([^<]*)/g) {
push @linklist, $1;
}
You could also expand this a bit to pull the href out, if you so chose.
If you want to continue using a parser like HTML::LinkExtor, and that
one doesn't do what you want, it seems like you should look at using a
different parser (HTML::Parser, for instance, which it derives from).
Alternately you could change HTML::LinkExtor and submit your patch to
the package maintainer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://mail.pm.org/archives/spug-list/attachments/20001005/ca50c97a/attachment.bin
More information about the spug-list
mailing list