[Chicago-talk] parsing HTML
Jonathan Rockway
jon at jrock.us
Fri Feb 23 14:09:53 PST 2007
Jay Strauss wrote:
> Hi,
>
> I need to parse out the text from HTML like:
>
> <SPAN class="main-body"><B>Street Address</B></SPAN>
>
> to pluck out "Street Address"
>
> or
>
> <SPAN class="main-body">
> <span id="UcGeoResult11_lbZipCode"><font color="
> Navy">60643</font></span></SPAN>
>
> to pluck out "60643"
>
> Would you suggest using a regex (that I can't get to work) or some
> module (like HTML::Parser)?
I'm thinking you want HTML::TreeBuilder::XPath. (XPath is like SQL for
trees.)
http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.08/lib/HTML/TreeBuilder/XPath.pm
Then you can do something like
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "mypage.html");
my $zip = $html->
findnodes('//span[@id="UcGeoResult11_lbZibCode"]/font');
If you're lucky, you can do:
my @zips = ... /span[@class="main-body"]/span/font[@color="Navy"]
or something similar to get a list of all zips instead of just a single
row. Or you can do:
my @data;
my @rows = ... /span[@class="main-body"]/span
foreach my $r (@rows){
my $row;
$row->{zip} = $r-> ... /span/font[@color="Navy"];
$row->{state}= $r-> ... /span/font[@color="Red, maybe?"];
# etc.
push @data, $row;
}
Now @data is a beautifully organized data set! (Until they change their
colors, of course -- but that's what life is like for screen-scrapers.)
Details of XPath are here: http://www.w3.org/TR/xpath
If you want to try out XPath queries visually in Firefox, get this
extension:
https://addons.mozilla.org/firefox/1095/
Regards,
Jonathan Rockway
--
package JAPH;use Catalyst qw/-Debug/;($;=JAPH)->config(name => do {
$,.=reverse qw[Jonathan tsu rehton lre rekca Rockway][$_].[split //,
";$;"]->[$_].q; ;for 1..4;$,=~s;^.;;;$,});$;->setup;
More information about the Chicago-talk
mailing list