SPUG: XPath on (less-than-perfect) HTML

Michael R. Wolf MichaelRWolf at att.net
Thu Dec 31 13:41:51 PST 2009


On Dec 31, 2009, at 1:15 PM, Joshua ben Jore wrote:

> On Tue, Nov 17, 2009 at 1:33 PM, Michael R. Wolf  
> <MichaelRWolf at att.net> wrote:
>> Yes, I know that XPath can only be applied to well-formed XML.
>>
>> That's the theoretical, pure, absolute truth.
>

> I've happily used XML::LibXML per Randal Schwartz in Linux Magazine
> (Jun 2003) at http://www.stonehenge.com/merlyn/LinuxMag/col49.html


Thanks.  Randal's article(s) were one of my motivations for using  
XPATH.  I got my code working after fixing two version problems on my  
Mac, both of which I think were nice, though in hind sight, I don't  
think change #1 was strictly necessary.  Without a deep analysis of  
the changes, my I went with my gut (and the expertise of the authors)  
and updated the CPAN module.
   1.  Updated XML::LibXML to version 1.70 from CPAN
   2.  updated libxml2 (version 2.7.6) from macports

I've appended a fragment of the code I got working.  It's not yet  
perfect (for some[1] definition of perfect), but it works.  That is, I  
did the elegant "growth" phase but haven't completed the elegant  
"prune" phase.

Enjoy,
Michael

Notes:
1. For *this* definition of perfection...

Perfection is achieved not when you have nothing more to add,
but when you have nothing left to take away.

   -- Antoine de Saint-Exupery
     -- as quoted on http://perlgolf.sourceforge.net

================================================================

     my %parse_options = (
			 #suppress_warnings => 1,
			 suppress_errors => 1,
			 recover => 1,
			 # validation => 0,
			);

     # Former versions...
     my $dom;
     if (XML::LibXML->can('load_html')) {
	# Works on mac at v1.70, but not on PC at v1.65
	# my $dom = $parser->load_html(string=>$content, \%parse_options);
	$dom = XML::LibXML->load_html(string=>$content, \%parse_options);
     }
     else {
	# Works on PC at v1.65
	my $parser = XML::LibXML->new(\%parse_options);
	my $doc = $parser->parse_html_string($content, \%parse_options);
	$dom = $doc;
     }

#... snip, snip...

my @nodes = $dom->findnode($xpath);


-- 
Michael R. Wolf
     All mammals learn by playing!
         MichaelRWolf at att.net






More information about the spug-list mailing list