[Chicago-talk] parsing HTML
Jay Strauss
me at heyjay.com
Fri Feb 23 13:50:24 PST 2007
On 2/23/07, Andy Lester <andy at petdance.com> wrote:
>
> On Feb 23, 2007, at 3:18 PM, Jay Strauss wrote:
>
> > Would you suggest using a regex (that I can't get to work) or some
> > module (like HTML::Parser)?
>
> If all you want is the text, look at WWW::Mechanize's ->content()
> method.
>
> http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm#%
> 24mech-%3Econtent(...)
>
> $mech->content( format => "text" )
>
> Returns a text-only version of the page, with all HTML markup
> stripped. This feature requires HTML::TreeBuilder to be installed, or
> a fatal error will be thrown.
I'm already using mech, but I have to parse a specific table the code
I'm using is below. Is there a way to make mech return just text of a
table?
use WWW::Mechanize;
use HTML::TableContentParser;
use Data::Dumper;
my $mech = WWW::Mechanize->new();
$mech->get( 'http://www.ffiec.gov/Geocode/default.aspx' );
$mech->form_name("Form1");
$mech->submit_form(
form_name => "Form1",
fields => {
txtZipCode => $someZip,
txtAddress => $someAddress,
},
button => "btnSearch",
);
my $p = HTML::TableContentParser->new();
my $tables_ref = $p->parse($mech->content);
(my $geo_data) = grep { $_->{id} =~ /table2/i } @$tables_ref;
if ($geo_data) {
foreach my $row (@{$geo_data->{rows}}) {
foreach my $cell (@{$row->{cells}}) {
$cell->{data} =~ />(\s*[^<\s][^<]*)</;
print $1,"\n";
}
}
}
else { # Address not found
$na = 'n/a';
#return ($na, $na, $na);
}
More information about the Chicago-talk
mailing list