[Chicago-talk] parsing HTML

Jonathan Rockway jon at jrock.us
Fri Feb 23 14:09:53 PST 2007


Jay Strauss wrote:
> Hi,
> 
> I need to parse out the text from HTML like:
> 
> <SPAN class="main-body"><B>Street Address</B></SPAN>
> 
> to pluck out "Street Address"
> 
> or
> 
> <SPAN class="main-body">
>                                 <span id="UcGeoResult11_lbZipCode"><font color="
> Navy">60643</font></span></SPAN>
> 
> to pluck out "60643"
> 
> Would you suggest using a regex (that I can't get to work) or some
> module (like HTML::Parser)?

I'm thinking you want HTML::TreeBuilder::XPath.  (XPath is like SQL for
trees.)

http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.08/lib/HTML/TreeBuilder/XPath.pm

Then you can do something like

 use HTML::TreeBuilder::XPath;
 my $tree= HTML::TreeBuilder::XPath->new;
 $tree->parse_file( "mypage.html");
 my $zip = $html->
   findnodes('//span[@id="UcGeoResult11_lbZibCode"]/font');

If you're lucky, you can do:

 my @zips = ... /span[@class="main-body"]/span/font[@color="Navy"]

or something similar to get a list of all zips instead of just a single
row.  Or you can do:

 my @data;
 my @rows = ... /span[@class="main-body"]/span
 foreach my $r (@rows){
   my $row;
   $row->{zip}  = $r-> ... /span/font[@color="Navy"];
   $row->{state}= $r-> ... /span/font[@color="Red, maybe?"];
     # etc.
   push @data, $row;
 }

Now @data is a beautifully organized data set!  (Until they change their
colors, of course -- but that's what life is like for screen-scrapers.)

Details of XPath are here: http://www.w3.org/TR/xpath

If you want to try out XPath queries visually in Firefox, get this
extension:

https://addons.mozilla.org/firefox/1095/

Regards,
Jonathan Rockway

-- 
package JAPH;use Catalyst qw/-Debug/;($;=JAPH)->config(name => do {
$,.=reverse qw[Jonathan tsu rehton lre rekca Rockway][$_].[split //,
";$;"]->[$_].q; ;for 1..4;$,=~s;^.;;;$,});$;->setup;


More information about the Chicago-talk mailing list