[Pdx-pm] NEWBIE question - Am i making this too complex
parsing HTML
Pete Lancashire
nix at petelancashire.com
Sat May 7 17:33:05 PDT 2005
On Sat, 2005-05-07 at 16:00, Ovid wrote:
> Hi Pete,
>
> That seems like some fairly nice code for a newbie. (Are you sure
> you're a newbie?)
I consider myself new, I guess about 1K lines of Perl total. And this
is a snippet of my first 'real' Perl programs vs. the may 1-20 line
scripts, and my first pass at using all the debug, warning, strict etc.
> I noticed you mention that you need LWP. If you can explain what
> features you need beyond LWP::Simple, I can see what I can do about
> expaning HTML::TokeParser::Simple to incorporate those needs, perhaps
> by allowing you to pass a callback that will fetch the HTML for you.
The program this is from (I was doing my own parsing up to now) a
program that queries the same server with URL's, that have been
created from parsing Forms. I got tired of writing comments "#
this needs to be done by a HTML parser".
It could vary will be that ::Simple will work, fingers crossed.
The next URL will my to have a cookie or two.
>
> In the meantime, if LWP::Simple were sufficient (though it sounds like
> it might not be), the following will accomplish what you have:
>
> use HTML::TokeParser::Simple 3.13;
> my $p = HTML::TokeParser::Simple->new(url => $url) or die $!;
>
> > my ($href, $token);
> >
> > while ( $token = $p->get_token ) {
> > if ( $token->is_start_tag('map') && ( $token->get_attr('name') eq
> > 'region' ) ) {
> > until ($token->is_end_tag('map') ) {
> > $token = $p->get_token;
> > if ($token->is_start_tag('area') ) {
> > $href = $token->get_attr('href');
> > print "HREF:$href\n";
> > }
> > }
> > last;
> > }
> > }
>
> Remember that HTML::TokeParser::Simple is a subclass of
> HTML::TokeParser, so the methods in the latter still work. In
> particular, you can call "get_tag" with a tag name to jump straight to
> it (though you need to be careful not to overshoot other tags that are
> important.) Here's how I might write that, though I'm not sure it's
> much bettter.
>
> while (my $token = $p->get_tag('map')) {
> until ($token->is_end_tag('map')) {
> $token = $p->get_tag or last; # out of HTML
> next unless $token->is_start_tag('area');
> my $href = $token->get_attr('href') or next;
> print "HREF: $href\n";
> }
> }
>
> Cheers,
> Ovid
Thanks for the reply
-pete
More information about the Pdx-pm-list
mailing list