Hm. Thats an interesting problem, but one I've handled.<br><br>The approach I have pragmatically approached this kind of problem with is HTML::Parser - its just an event based parser that can fire my custom methods when hitting various data bits - start, end, comment, text, entities, etc. Throw a factory pattern in front of the parser object creation/accessor so that you get the right parser for the data source and just parse it.<br>
<br>I just create a finite state machine for the particular data I'm tracking - example for "the text contents of the third TD on the second row of the table with the ID "explicit"" would result in something like the code below, which is probably wrong and won't compile because I didn't look at teh documentation and just typed it here.<br>
<br>But at least you don't have to regexp and you don't have to worry about chunking and how that affects your expressions, right?<br><br>pragmatically speaking, thats a win. Long as your data source types are fairly static, and limited in number.<br>
<br>David<br><br>package myHTMLParser::tabletr2td3;<br>use base 'HTML::Parser';<br><br>sub text {<br> my $text = shift;<br> $DATA .= $text if ($state eq 'DATACOLLECT');<br>}<br>sub start {<br> my ($tag, $attr) = @_);<br>
if ($tag eq 'table') {<br> if ($attr->{id} eq 'explicit') {<br> $state = "INTABLE";<br> $tdcount = 0;<br> $trcount = 0;<br> $level = 0;<br> }<br> $level ++;<br>
}<br> if ($tag eq 'tr') {<br> $trcount ++ if ($level == 1 && $state eq 'INTABLE');<br> }<br> if ($tag eq 'td') {<br> $tdcount ++ if ($level == 1 && $trcount == 2 && $state eq 'INTABLE');<br>
}<br> if ($state eq 'INTABLE' && $tdcount == 3 && $trcount == 2 && level == 1) {<br>
$state = 'DATACOLLECT';<br>
}<br>
<br>}<br>sub end { my ($tag) = @_;<br> if ($tag eq 'table') {<br> $level -- if ($level > 1);<br> undef $state unless ($level);<br> }<br> if ($tag eq 'td' && $state eq 'DATACOLLECT') {<br>
$state = 'INTABLE';<br> }<br>}<br><br clear="all">"If only I could get rid of hunger by rubbing my belly" - Diogenes<br>
<br><br><div class="gmail_quote">On Tue, Nov 17, 2009 at 1:33 PM, Michael R. Wolf <span dir="ltr"><<a href="mailto:MichaelRWolf@att.net">MichaelRWolf@att.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Yes, I know that XPath can only be applied to well-formed XML.<br>
<br>
That's the theoretical, pure, absolute truth.<br>
<br>
I'm working in the real world where I can't find a well-formed page. (For instance, <a href="http://validator.w3c.org" target="_blank">http://validator.w3c.org</a> does not validate such biggies as <a href="http://amazon.com" target="_blank">amazon.com</a>, <a href="http://ask.com" target="_blank">ask.com</a>, <a href="http://google.com" target="_blank">google.com</a>, or <a href="http://msn.com" target="_blank">msn.com</a>). For (my) practical purposes, there are no valid pages.<br>
<br>
What am I to (practically, not theoretically) do?<br>
<br>
What tricks do practical XPath users know that I might not?<br>
<br>
I'm trying to scrape pages across sites to aggregate data.<br>
<br>
I'm loathe to use regular expressions for all the pure reasons, but if pure isn't workable outside the ivory towers, that purity is useless in the real world.<br>
<br>
I've already tried:<br>
tidy -asxhtml<br>
tidy -asxml<br>
HTML::TokeParser<br>
XML::XPath<br>
XML::LibXML<br>
<br>
I can't take step #2 because step #1 (parsing the data) fails.<br>
<br>
Thanks for *practical* ideas, tricks, tips, and pointers....<br>
<br>
Michael<br>
<br>
-- <br>
Michael R. Wolf<br>
All mammals learn by playing!<br>
<a href="mailto:MichaelRWolf@att.net" target="_blank">MichaelRWolf@att.net</a><br>
<br>
<br>
<br>
<br>
_____________________________________________________________<br>
Seattle Perl Users Group Mailing List<br>
POST TO: <a href="mailto:spug-list@pm.org" target="_blank">spug-list@pm.org</a><br>
SUBSCRIPTION: <a href="http://mail.pm.org/mailman/listinfo/spug-list" target="_blank">http://mail.pm.org/mailman/listinfo/spug-list</a><br>
MEETINGS: 3rd Tuesdays<br>
WEB PAGE: <a href="http://seattleperl.org/" target="_blank">http://seattleperl.org/</a><br>
</blockquote></div><br>