[Chicago-talk] parsing HTML

Fri Feb 23 13:50:24 PST 2007

On 2/23/07, Andy Lester <andy at petdance.com> wrote:
>
> On Feb 23, 2007, at 3:18 PM, Jay Strauss wrote:
>
> > Would you suggest using a regex (that I can't get to work) or some
> > module (like HTML::Parser)?
>
> If all you want is the text, look at WWW::Mechanize's  ->content()
> method.
>
> http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm#%
> 24mech-%3Econtent(...)
>
> $mech->content( format => "text" )
>
>      Returns a text-only version of the page, with all HTML markup
> stripped. This feature requires HTML::TreeBuilder to be installed, or
> a fatal error will be thrown.

I'm already using mech, but I have to parse a specific table the code
I'm using is below.  Is there a way to make mech return just text of a
table?

	use WWW::Mechanize;
	use HTML::TableContentParser;
	use Data::Dumper;

   my $mech = WWW::Mechanize->new();

   $mech->get( 'http://www.ffiec.gov/Geocode/default.aspx' );

   $mech->form_name("Form1");

   $mech->submit_form(
		form_name => "Form1",
		fields => {
		   txtZipCode	=> $someZip,
		   txtAddress	=> $someAddress,
		},
		button => "btnSearch",
	);

   my $p = HTML::TableContentParser->new();
   my $tables_ref = $p->parse($mech->content);

	(my $geo_data) = grep { $_->{id} =~ /table2/i } @$tables_ref;

	if ($geo_data) {

		foreach my $row (@{$geo_data->{rows}}) {

			foreach my $cell (@{$row->{cells}}) {
				$cell->{data} =~ />(\s*[^<\s][^<]*)</;
				print $1,"\n";
			}		
	}
	}
	else { # Address not found
		$na = 'n/a';
		#return ($na, $na, $na);
	}