SPUG: extracting text between <a> and </a>

Fri Oct 6 02:28:24 CDT 2000

Hi. This mail is in response to the various REx solutions posted for the
HTML parsing question. Note please that my aim is not to criticize,
merely to point out. I've left attributions off, as I'm responding to 5
different messages at once.

> $link =~ s/<a\s+href=.*?>//i;
> $link =~ s/<\/a>//i;

This doesn't work for the following valid HTML 3.2 data:

  <a href ="foo">Bar</a>
  <a href="foo">Bar</a >
  <a name="foo" href="bar">Baz</a>
  <a href="foo>bar">Baz</a>

>     m|<a[\w:"/= \.]*> ([\w ]*)</a>| and $text = $1;

For this one, problems include:

  <a href="foo">Bar</a>
  <a href="http://foo.com/bar?baz=blarch"> Bing</a>
  <a href="foo"> Bar &amp; Baz</a>
  <a href="foo>bar"> Baz</a>

> while ($html =~ /<[aA]\b[^>]*>([^<]*)/g) {

  <a href="foo">Bar <strong>Baz</strong></a>
  <a href="foo">Bar "<" Baz</a>
  <a href="foo>bar">Baz</a>

>                 ($page_text =~ /<\s*          # start of the tag (<)
>                                 $tag          # the 'a'
>                                 \s+           # some space
>                                 $attr         # the 'href'
>                                 \s*=\s*"?     # = and maybe a "
>                                 $url_to_match # the URL
>                                 \s*"?\s*>\s*  # maybe a " and the >
>                                 (.*?)         # the text we want
>                                 <\/a>/ixs);   # the ending </a>

(For these, assuming that the $url_to_match is set as "foo")

  <a href='foo'>Bar</a>
  <a href="foo" name="bar">Baz</a>
  <a href="foo">Foo <!-- </a> --> Bar</a>
  <a href="foo>bar">Baz</a>

> Thanks to all who have responded...  
> 
> I guess I'm a little surprised that there isn't some existing simple
> method of HTML::Parser or ::LinkExtor to give you this info.  You've
> all provided interesting ways of tackling the issue, I'll have to
> experiment to determine which one will work best for me.  

There is. LinkExtor isn't the way to go, as its stated purpose is to
grab _only_ the links. However, using TokeParser or Parser you shouldn't
have many problems:

  use HTML::TokeParser;
  my $parser = HTML::TokeParser::->new($ARGV[0] || STDIN);
  my %urls;
  while ($_ = $parser->get_tag("a")) {
      push @{$urls{$_->[1]{href}}}, $parser->get_trimmed_text("/a")
          if exists $_->[1]{href};
  }
  local $" = "\n\t";
  print "$_ => @{$urls{$_}}\n" for keys %urls;

Unless I'm misunderstanding what you want, this should do it. You end up
with a hash, %urls, that contains each link in the document as keys, and
as corresponding values a ref to an array of the text found between
those links.

-dlc

P.S. Just as one final disclaimer, none of the code I've quoted above is
bad Perl. I just worry about people seeing those solutions and thinking
that they're robust enough to use in production. Please don't take my
comments as anything but friendly comments. Thanks.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/