SPUG: extracting text between <a> and </a>
Todd Wells
toddw at wrq.com
Thu Oct 5 10:54:59 CDT 2000
I'm working on a little web automation routine and I've used HTML::LinkExtor
to extract the links from a web page, then I'm processing each of those
links.
What I'd like to know is if there's some easy way that I could get the
original text that accompanied that link. e.g., <a href =
"http://thislink"> this text here I want </a>.
sub link_scan
{
# input is $url, output is a list of links found at that URL
my $url = shift;
my @linklist; my @ziplist;
# retrieve HTML doc at URL
my $ua = new LWP::UserAgent;
my $request = new HTTP::Request('GET', $url);
my $response = $ua->request($request);
my $body = $response->content;
my $base = $response->base;
# scan HTML doc for other URLS
my $link_parser = HTML::LinkExtor->new();
$link_parser->parse($body);
my @parsed = $link_parser->links;
foreach my $link (@parsed)
{
my $tag = $link->[0];
if (($tag eq "a") or ($tag eq "A"))
{
my $text = $link_parser->get_trimmed_text
my $new_url = new URI::URL $link->[2];
my $full_url = $new_url->abs($url);
chomp $full_url;
unless (already_processed($full_url)) {push @linklist,
$full_url;}
}
}
return @linklist;
}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
POST TO: spug-list at pm.org PROBLEMS: owner-spug-list at pm.org
Subscriptions; Email to majordomo at pm.org: ACTION LIST EMAIL
Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
For daily traffic, use spug-list for LIST ; for weekly, spug-list-digest
Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
More information about the spug-list
mailing list