[Pdx-pm] NEWBIE question - Am i making this too complex parsing HTML

Ovid publiustemp-pdxpm at yahoo.com
Sat May 7 16:00:09 PDT 2005


Hi Pete,

That seems like some fairly nice code for a newbie.  (Are you sure
you're a newbie?)


> use LWP;
> use HTML::TokeParser::Simple;
> 
> # using LWP instead of Simple for future needs
> my $browser = LWP::UserAgent->new;
> my $url = "http://www.undeerc.org/wind/winddb";
> 
> my $response = $browser->get( $url );
>   die "Can’t get $url -- ", $response->status_line
>    unless $response->is_success;
> 
> my $content = $response->content;
> 
> $content =~ s/\r//g;
> 
> my $p=HTML::TokeParser::Simple->new(\$content);

I noticed you mention that you need LWP.  If you can explain what
features you need beyond LWP::Simple, I can see what I can do about
expaning HTML::TokeParser::Simple to incorporate those needs, perhaps
by allowing you to pass a callback that will fetch the HTML for you.

In the meantime, if LWP::Simple were sufficient (though it sounds like
it might not be), the following will accomplish what you have:

  use HTML::TokeParser::Simple 3.13;
  my $p = HTML::TokeParser::Simple->new(url => $url) or die $!;
 
> my ($href, $token);
> 
> while ( $token = $p->get_token ) {
>   if ( $token->is_start_tag('map') && ( $token->get_attr('name') eq
> 'region' ) ) {
>     until ($token->is_end_tag('map') ) {
>       $token = $p->get_token;
>       if ($token->is_start_tag('area') ) {
>         $href = $token->get_attr('href');
>         print "HREF:$href\n";
>       }
>     }
>     last;
>   }
> }

Remember that HTML::TokeParser::Simple is a subclass of
HTML::TokeParser, so the methods in the latter still work.  In
particular, you can call "get_tag" with a tag name to jump straight to
it (though you need to be careful not to overshoot other tags that are
important.)  Here's how I might write that, though I'm not sure it's
much bettter.

  while (my $token = $p->get_tag('map')) {
    until ($token->is_end_tag('map')) {
      $token = $p->get_tag or last; # out of HTML
      next unless $token->is_start_tag('area');
      my $href = $token->get_attr('href') or next;
      print "HREF: $href\n";
    }
  }

Cheers,
Ovid

-- 
If this message is a response to a question on a mailing list, please send
follow up questions to the list.

Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/


More information about the Pdx-pm-list mailing list