Hm.  Thats an interesting problem, but one I&#39;ve handled.<br><br>The approach I have pragmatically approached this kind of problem with is HTML::Parser - its just an event based parser that can fire my custom methods when hitting various data bits - start, end, comment, text, entities, etc.  Throw a factory pattern in front of the parser object creation/accessor so that you get the right parser for the data source and just parse it.<br>


<br>I just create a finite state machine for the particular data I&#39;m tracking - example for &quot;the text contents of the third TD on the second row of the table with the ID &quot;explicit&quot;&quot; would result in something like the code below, which is probably wrong and won&#39;t compile because I didn&#39;t look at teh documentation and just typed it here.<br>


<br>But at least you don&#39;t have to regexp and you don&#39;t have to worry about chunking and how that affects your expressions, right?<br><br>pragmatically speaking, thats a win.  Long as your data source types are fairly static, and limited in number.<br>


<br>David<br><br>package myHTMLParser::tabletr2td3;<br>use base &#39;HTML::Parser&#39;;<br><br>sub text {<br>  my $text = shift;<br>  $DATA .= $text if ($state eq &#39;DATACOLLECT&#39;);<br>}<br>sub start {<br>  my ($tag, $attr) = @_);<br>


  if ($tag eq &#39;table&#39;) {<br>    if ($attr-&gt;{id} eq &#39;explicit&#39;) {<br>       $state = &quot;INTABLE&quot;;<br>       $tdcount = 0;<br>        $trcount = 0;<br>       $level = 0;<br>    }<br>    $level ++;<br>


  }<br>  if ($tag eq &#39;tr&#39;) {<br>    $trcount ++ if ($level == 1 &amp;&amp; $state eq &#39;INTABLE&#39;);<br>  }<br>  if ($tag eq &#39;td&#39;) {<br>    $tdcount ++ if ($level == 1 &amp;&amp; $trcount == 2 &amp;&amp; $state eq &#39;INTABLE&#39;);<br>


  }<br>  if ($state eq &#39;INTABLE&#39; &amp;&amp; $tdcount == 3 &amp;&amp; $trcount == 2 &amp;&amp; level == 1) {<br>


    $state = &#39;DATACOLLECT&#39;;<br>


  }<br>


<br>}<br>sub end { my ($tag) = @_;<br>  if ($tag eq &#39;table&#39;) {<br>    $level -- if ($level &gt; 1);<br>    undef $state unless ($level);<br>  }<br>  if ($tag eq &#39;td&#39; &amp;&amp; $state eq &#39;DATACOLLECT&#39;) {<br>


    $state = &#39;INTABLE&#39;;<br>  }<br>}<br><br clear="all">&quot;If only I could get rid of hunger by rubbing my belly&quot; - Diogenes<br>

<br><br><div class="gmail_quote">On Tue, Nov 17, 2009 at 1:33 PM, Michael R. Wolf <span dir="ltr">&lt;<a href="mailto:MichaelRWolf@att.net">MichaelRWolf@att.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Yes, I know that XPath can only be applied to well-formed XML.<br>

<br>

That&#39;s the theoretical, pure, absolute truth.<br>

<br>

I&#39;m working in the real world where I can&#39;t find a well-formed page.  (For instance, <a href="http://validator.w3c.org" target="_blank">http://validator.w3c.org</a> does not validate such biggies as <a href="http://amazon.com" target="_blank">amazon.com</a>, <a href="http://ask.com" target="_blank">ask.com</a>, <a href="http://google.com" target="_blank">google.com</a>, or <a href="http://msn.com" target="_blank">msn.com</a>).  For (my) practical purposes, there are no valid pages.<br>


<br>

What am I to (practically, not theoretically) do?<br>

<br>

What tricks do practical XPath users know that I might not?<br>

<br>

I&#39;m trying to scrape pages across sites to aggregate data.<br>

<br>

I&#39;m loathe to use regular expressions for all the pure reasons, but if pure isn&#39;t workable outside the ivory towers, that purity is useless in the real world.<br>

<br>

I&#39;ve already tried:<br>

    tidy -asxhtml<br>

    tidy -asxml<br>

    HTML::TokeParser<br>

    XML::XPath<br>

    XML::LibXML<br>

<br>

I can&#39;t take step #2 because step #1 (parsing the data) fails.<br>

<br>

Thanks for *practical* ideas, tricks, tips, and pointers....<br>

<br>

Michael<br>

<br>

-- <br>

Michael R. Wolf<br>

    All mammals learn by playing!<br>

        <a href="mailto:MichaelRWolf@att.net" target="_blank">MichaelRWolf@att.net</a><br>

<br>

<br>

<br>

<br>

_____________________________________________________________<br>

Seattle Perl Users Group Mailing List<br>

    POST TO: <a href="mailto:spug-list@pm.org" target="_blank">spug-list@pm.org</a><br>

SUBSCRIPTION: <a href="http://mail.pm.org/mailman/listinfo/spug-list" target="_blank">http://mail.pm.org/mailman/listinfo/spug-list</a><br>

   MEETINGS: 3rd Tuesdays<br>

   WEB PAGE: <a href="http://seattleperl.org/" target="_blank">http://seattleperl.org/</a><br>

</blockquote></div><br>