SPUG: extracting text between <a> and </a>
Joel Grow
largest at largest.org
Thu Oct 5 13:35:28 CDT 2000
Great responses to this question. I should have thought of
Text::Balanced, but I went right to regex like several other people have
already mentioned. I also wanted to add that you can simplify some of
this by using LWP::Simple, if you just want to grab the URL page text.
Here's my version:
#!/usr/local/bin/perl
use strict;
use LWP::Simple;
use HTML::LinkExtor;
use Data::Dumper;
my $url = shift || 'http://www.largest.org';
# %links contains URLs as keys and the text as values
#
# ie: %links = ( 'http://www.halcyon.com/spug/' => 'SPUG homepage',
# 'http://www.perl.com' => 'Perl homepage' );
my %links = link_scan($url);
print Dumper (\%links);
exit;
sub link_scan {
# input is $url, output is a list of links found at that URL
my ($url) = @_;
my %links;
# retrieve HTML doc at URL
my $page_text = get($url);
my $link_parser = HTML::LinkExtor->new();
# scan HTML doc for other URLS
$link_parser->parse($page_text);
my @links = $link_parser->links;
foreach my $link ( @links ) {
# $link = ['a',
# 'href',
# 'http://www.foobar.com/baz/blotz.html' ]
my ($tag, $attr, $url) = @$link;
if ( $tag =~ /^[aA]$/ ) {
# make sure to escape things like ? that can appear in URLs
my $url_to_match = quotemeta $url;
my ($tag_text) =
($page_text =~ /<\s* # start of the tag (<)
$tag # the 'a'
\s+ # some space
$attr # the 'href'
\s*=\s*"? # = and maybe a "
$url_to_match # the URL
\s*"?\s*>\s* # maybe a " and the >
(.*?) # the text we want
<\/a>/ixs); # the ending </a>
$links{$url} = $tag_text;
}
}
return %links;
}
__END__
On Thu, 5 Oct 2000, Todd Wells wrote:
> I'm working on a little web automation routine and I've used HTML::LinkExtor
> to extract the links from a web page, then I'm processing each of those
> links.
>
> What I'd like to know is if there's some easy way that I could get the
> original text that accompanied that link. e.g., <a href =
> "http://thislink"> this text here I want </a>.
>
>
> sub link_scan
> {
> # input is $url, output is a list of links found at that URL
>
> my $url = shift;
> my @linklist; my @ziplist;
>
> # retrieve HTML doc at URL
> my $ua = new LWP::UserAgent;
> my $request = new HTTP::Request('GET', $url);
> my $response = $ua->request($request);
> my $body = $response->content;
> my $base = $response->base;
>
> # scan HTML doc for other URLS
> my $link_parser = HTML::LinkExtor->new();
> $link_parser->parse($body);
> my @parsed = $link_parser->links;
>
> foreach my $link (@parsed)
> {
> my $tag = $link->[0];
>
> if (($tag eq "a") or ($tag eq "A"))
> {
> my $text = $link_parser->get_trimmed_text
> my $new_url = new URI::URL $link->[2];
> my $full_url = $new_url->abs($url);
> chomp $full_url;
> unless (already_processed($full_url)) {push @linklist,
> $full_url;}
> }
> }
> return @linklist;
> }
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> POST TO: spug-list at pm.org PROBLEMS: owner-spug-list at pm.org
> Subscriptions; Email to majordomo at pm.org: ACTION LIST EMAIL
> Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
> For daily traffic, use spug-list for LIST ; for weekly, spug-list-digest
> Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
>
>
>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
POST TO: spug-list at pm.org PROBLEMS: owner-spug-list at pm.org
Subscriptions; Email to majordomo at pm.org: ACTION LIST EMAIL
Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
For daily traffic, use spug-list for LIST ; for weekly, spug-list-digest
Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
More information about the spug-list
mailing list