SPUG: XPath on (less-than-perfect) HTML

Michael R. Wolf MichaelRWolf at att.net
Tue Nov 17 21:44:39 PST 2009


On Nov 17, 2009, at 1:33 PM, Michael R. Wolf wrote:

> Yes, I know that XPath can only be applied to well-formed XML.

...that is, unless it's told to recover from (and be quiet about)  
errors...

XML::LibXML::Parser documents recover() and (obsolete)  
recover_silently() methods.

Here's code that I got to work.

Line 3 allows the parser to continue.
Line 4 suppresses its warnings.

1. use XML::LibXML;
2. my $parser = XML::LibXML->new();
3. $parser->recover(1);
4. $parser->recover(2);
5. my $doc = $parser->parse_html_string($scraped_content);

6. ($first_node, @nodes) = $doc->findnodes('/html/head/title');
7. ok($first_node, 'HTML Title:  Found one node...');
8. ok(@nodes == 0, '... and no more nodes.');
9. my $title = $first_node->textContent();


-- 
Michael R. Wolf
     All mammals learn by playing!
         MichaelRWolf at att.net






More information about the spug-list mailing list