[Pdx-pm] NEWBIE question - Am i making this too complex parsing HTML

Pete Lancashire nix at petelancashire.com
Sat May 7 17:33:05 PDT 2005


On Sat, 2005-05-07 at 16:00, Ovid wrote:
> Hi Pete,
> 
> That seems like some fairly nice code for a newbie.  (Are you sure
> you're a newbie?)

I consider myself new, I guess about 1K lines of Perl total. And this
is a snippet of my first 'real' Perl programs vs. the may 1-20 line
scripts, and my first pass at using all the debug, warning, strict etc.

> I noticed you mention that you need LWP.  If you can explain what
> features you need beyond LWP::Simple, I can see what I can do about
> expaning HTML::TokeParser::Simple to incorporate those needs, perhaps
> by allowing you to pass a callback that will fetch the HTML for you.

The program this is from (I was doing my own parsing up to now) a
program that queries the same server with URL's, that have been
created from parsing Forms. I got tired of writing comments "#
this needs to be done by a HTML parser".

It could vary will be that ::Simple will work, fingers crossed.
The next URL will my to have a cookie or two.

> 
> In the meantime, if LWP::Simple were sufficient (though it sounds like
> it might not be), the following will accomplish what you have:
> 
>   use HTML::TokeParser::Simple 3.13;
>   my $p = HTML::TokeParser::Simple->new(url => $url) or die $!;
>  
> > my ($href, $token);
> > 
> > while ( $token = $p->get_token ) {
> >   if ( $token->is_start_tag('map') && ( $token->get_attr('name') eq
> > 'region' ) ) {
> >     until ($token->is_end_tag('map') ) {
> >       $token = $p->get_token;
> >       if ($token->is_start_tag('area') ) {
> >         $href = $token->get_attr('href');
> >         print "HREF:$href\n";
> >       }
> >     }
> >     last;
> >   }
> > }
> 
> Remember that HTML::TokeParser::Simple is a subclass of
> HTML::TokeParser, so the methods in the latter still work.  In
> particular, you can call "get_tag" with a tag name to jump straight to
> it (though you need to be careful not to overshoot other tags that are
> important.)  Here's how I might write that, though I'm not sure it's
> much bettter.
> 
>   while (my $token = $p->get_tag('map')) {
>     until ($token->is_end_tag('map')) {
>       $token = $p->get_tag or last; # out of HTML
>       next unless $token->is_start_tag('area');
>       my $href = $token->get_attr('href') or next;
>       print "HREF: $href\n";
>     }
>   }
> 
> Cheers,
> Ovid

Thanks for the reply

-pete



More information about the Pdx-pm-list mailing list