SPUG: extracting text between <a> and </a>

Todd Wells toddw at wrq.com
Thu Oct 5 11:47:36 CDT 2000


I looked in the Cookbook, it has a recipe to extract the actual links (which
you'll see I'm doing in my code below), but I can't tell how to get the text
between the tags -- unless I'm looking at it incorrectly.

<a href="http://text.the.recipe.gets"> this is the text I actually want</a>

-----Original Message-----
From: Rush Family [mailto:rush at citylinq.com]
Sent: Thursday, October 05, 2000 9:30 AM
To: Todd Wells; 'SPUG'
Subject: RE: SPUG: extracting text between <a> and </a>


Although I do not have it in front of me to check, I believe this exact
problem is solved in the Perl Cookbook from O'Reilly.

-----Original Message-----
From: owner-spug-list at pm.org [mailto:owner-spug-list at pm.org]On Behalf Of
Todd Wells
Sent: Thursday, October 05, 2000 8:55 AM
To: 'SPUG'
Subject: SPUG: extracting text between <a> and </a>


I'm working on a little web automation routine and I've used HTML::LinkExtor
to extract the links from a web page, then I'm processing each of those
links.

What I'd like to know is if there's some easy way that I could get the
original text that accompanied that link.  e.g., <a href =
"http://thislink"> this text here I want </a>.


sub link_scan
{
    # input is $url, output is a list of links found at that URL

    my $url = shift;
    my @linklist; my @ziplist;

    # retrieve HTML doc at URL
    my $ua = new LWP::UserAgent;
    my $request = new HTTP::Request('GET', $url);
    my $response = $ua->request($request);
    my $body = $response->content;
    my $base = $response->base;

    # scan HTML doc for other URLS
    my $link_parser = HTML::LinkExtor->new();
    $link_parser->parse($body);
    my @parsed = $link_parser->links;

    foreach my $link (@parsed)
    {
	my $tag = $link->[0];

	if (($tag eq "a") or ($tag eq "A"))
	{
	    my $text = $link_parser->get_trimmed_text
	    my $new_url = new URI::URL $link->[2];
	    my $full_url = $new_url->abs($url);
	    chomp $full_url;
	    unless (already_processed($full_url)) {push @linklist,
$full_url;}
	}
    }
    return @linklist;
}

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/




 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/


 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/





More information about the spug-list mailing list