SPUG: extracting text between <a> and </a>

Joel Grow largest at largest.org
Thu Oct 5 13:35:28 CDT 2000


Great responses to this question.  I should have thought of
Text::Balanced, but I went right to regex like several other people have
already mentioned.  I also wanted to add that you can simplify some of
this by using LWP::Simple, if you just want to grab the URL page text.
Here's my version:


#!/usr/local/bin/perl

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use Data::Dumper;

my $url = shift || 'http://www.largest.org';

# %links contains URLs as keys and the text as values
#
# ie: %links = ( 'http://www.halcyon.com/spug/' => 'SPUG homepage',
#                'http://www.perl.com' => 'Perl homepage' );

my %links = link_scan($url);

print Dumper (\%links);
exit;

sub link_scan {
    # input is $url, output is a list of links found at that URL

    my ($url) = @_;

    my %links;

    # retrieve HTML doc at URL
    my $page_text = get($url);

    my $link_parser = HTML::LinkExtor->new();

    # scan HTML doc for other URLS
    $link_parser->parse($page_text);
    my @links = $link_parser->links;

    foreach my $link ( @links ) {

        # $link = ['a',
        #          'href',
        #          'http://www.foobar.com/baz/blotz.html' ]

        my ($tag, $attr, $url) = @$link;

        if ( $tag =~ /^[aA]$/ ) {
            # make sure to escape things like ? that can appear in URLs
            my $url_to_match = quotemeta $url;

            my ($tag_text) = 
                ($page_text =~ /<\s*          # start of the tag (<)
                                $tag          # the 'a'
                                \s+           # some space
                                $attr         # the 'href'
                                \s*=\s*"?     # = and maybe a "
                                $url_to_match # the URL
                                \s*"?\s*>\s*  # maybe a " and the >
                                (.*?)         # the text we want
                                <\/a>/ixs);   # the ending </a>

            $links{$url} = $tag_text;
	}
    }
    return %links;
}

__END__



On Thu, 5 Oct 2000, Todd Wells wrote:

> I'm working on a little web automation routine and I've used HTML::LinkExtor
> to extract the links from a web page, then I'm processing each of those
> links.
> 
> What I'd like to know is if there's some easy way that I could get the
> original text that accompanied that link.  e.g., <a href =
> "http://thislink"> this text here I want </a>. 
> 
> 
> sub link_scan
> {
>     # input is $url, output is a list of links found at that URL
> 
>     my $url = shift;
>     my @linklist; my @ziplist;
> 
>     # retrieve HTML doc at URL
>     my $ua = new LWP::UserAgent;
>     my $request = new HTTP::Request('GET', $url);
>     my $response = $ua->request($request);
>     my $body = $response->content;
>     my $base = $response->base;
> 
>     # scan HTML doc for other URLS
>     my $link_parser = HTML::LinkExtor->new();
>     $link_parser->parse($body);
>     my @parsed = $link_parser->links;
> 
>     foreach my $link (@parsed)
>     {
> 	my $tag = $link->[0];
> 
> 	if (($tag eq "a") or ($tag eq "A"))
> 	{
> 	    my $text = $link_parser->get_trimmed_text
> 	    my $new_url = new URI::URL $link->[2];
> 	    my $full_url = $new_url->abs($url);
> 	    chomp $full_url;
> 	    unless (already_processed($full_url)) {push @linklist,
> $full_url;}
> 	}
>     }
>     return @linklist;
> }
> 
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>      POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
>       Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
>   Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
>  For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
>   Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
> 
> 
> 



 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/





More information about the spug-list mailing list