SPUG: XPath on (less-than-perfect) HTML

Tue Nov 17 13:55:59 PST 2009

Hm.  Thats an interesting problem, but one I've handled.

The approach I have pragmatically approached this kind of problem with is
HTML::Parser - its just an event based parser that can fire my custom
methods when hitting various data bits - start, end, comment, text,
entities, etc.  Throw a factory pattern in front of the parser object
creation/accessor so that you get the right parser for the data source and
just parse it.

I just create a finite state machine for the particular data I'm tracking -
example for "the text contents of the third TD on the second row of the
table with the ID "explicit"" would result in something like the code below,
which is probably wrong and won't compile because I didn't look at teh
documentation and just typed it here.

But at least you don't have to regexp and you don't have to worry about
chunking and how that affects your expressions, right?

pragmatically speaking, thats a win.  Long as your data source types are
fairly static, and limited in number.

David

package myHTMLParser::tabletr2td3;
use base 'HTML::Parser';

sub text {
  my $text = shift;
  $DATA .= $text if ($state eq 'DATACOLLECT');
}
sub start {
  my ($tag, $attr) = @_);
  if ($tag eq 'table') {
    if ($attr->{id} eq 'explicit') {
       $state = "INTABLE";
       $tdcount = 0;
        $trcount = 0;
       $level = 0;
    }
    $level ++;
  }
  if ($tag eq 'tr') {
    $trcount ++ if ($level == 1 && $state eq 'INTABLE');
  }
  if ($tag eq 'td') {
    $tdcount ++ if ($level == 1 && $trcount == 2 && $state eq 'INTABLE');
  }
  if ($state eq 'INTABLE' && $tdcount == 3 && $trcount == 2 && level == 1) {
    $state = 'DATACOLLECT';
  }

}
sub end { my ($tag) = @_;
  if ($tag eq 'table') {
    $level -- if ($level > 1);
    undef $state unless ($level);
  }
  if ($tag eq 'td' && $state eq 'DATACOLLECT') {
    $state = 'INTABLE';
  }
}

"If only I could get rid of hunger by rubbing my belly" - Diogenes

On Tue, Nov 17, 2009 at 1:33 PM, Michael R. Wolf <MichaelRWolf at att.net>wrote:

> Yes, I know that XPath can only be applied to well-formed XML.
>
> That's the theoretical, pure, absolute truth.
>
> I'm working in the real world where I can't find a well-formed page.  (For
> instance, http://validator.w3c.org does not validate such biggies as
> amazon.com, ask.com, google.com, or msn.com).  For (my) practical
> purposes, there are no valid pages.
>
> What am I to (practically, not theoretically) do?
>
> What tricks do practical XPath users know that I might not?
>
> I'm trying to scrape pages across sites to aggregate data.
>
> I'm loathe to use regular expressions for all the pure reasons, but if pure
> isn't workable outside the ivory towers, that purity is useless in the real
> world.
>
> I've already tried:
>    tidy -asxhtml
>    tidy -asxml
>    HTML::TokeParser
>    XML::XPath
>    XML::LibXML
>
> I can't take step #2 because step #1 (parsing the data) fails.
>
> Thanks for *practical* ideas, tricks, tips, and pointers....
>
> Michael
>
> --
> Michael R. Wolf
>    All mammals learn by playing!
>        MichaelRWolf at att.net
>
>
>
>
> _____________________________________________________________
> Seattle Perl Users Group Mailing List
>    POST TO: spug-list at pm.org
> SUBSCRIPTION: http://mail.pm.org/mailman/listinfo/spug-list
>   MEETINGS: 3rd Tuesdays
>   WEB PAGE: http://seattleperl.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20091117/6d509424/attachment.html>