SPUG: extracting text between <a> and </a>

Thu Oct 5 13:04:35 CDT 2000

-- Todd Wells <toddw at wrq.com> spake thusly:

> I'm working on a little web automation routine and I've used
> HTML::LinkExtor to extract the links from a web page, then I'm
> processing each of those links.
>
> What I'd like to know is if there's some easy way that I could get the
> original text that accompanied that link.  e.g., <a href =
> "http://thislink"> this text here I want </a>.

You could do this with a simple regex. It ignores the structure of the 
document, but if you don't mind that then this should work:

while ($html =~ /<[aA]\b[^>]*>([^<]*)/g) {
	push @linklist, $1;
}

You could also expand this a bit to pull the href out, if you so chose.

If you want to continue using a parser like HTML::LinkExtor, and that 
one doesn't do what you want, it seems like you should look at using a 
different parser (HTML::Parser, for instance, which it derives from).

Alternately you could change HTML::LinkExtor and submit your patch to 
the package maintainer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://mail.pm.org/archives/spug-list/attachments/20001005/ca50c97a/attachment.bin