[Chicago-talk] Question about removing '’'
Jay Strauss
me at heyjay.com
Fri Sep 28 07:41:55 PDT 2012
Hi,
I'm scraping a web page (code below) using HTML::TreeBuilder. I'm trying
to get the info between the <td> </td>, but embedded in some of the values
is a ’ like:
<td align="left" nowrap>Today’s Volume</td>
What I want to do is remove the "’" or convert to a single quote,
within the HTML::TreeBuilder object, figuring that's probably a more
reliable approach.
What I'm currently doing is just converting to text and doing a regex
my $text = $cell->as_text;
$text =~ s/Today.s Volume/Today's Volume/;
Any suggestions on how to do this?
Thanks
Jay
use strict;
use WWW::Mechanize;
use HTML::TreeBuilder 5 -weak;
use Data::Dumper;
my $mech = retrieve_graham_quote("DELL");
my $info = parse_page($mech);
sub retrieve_graham_quote {
my $ticker = shift;
my $base_url = 'http://www.grahaminvestor.com/quotes/?ticker=';
my $mech = WWW::Mechanize->new();
$mech->get( $base_url.$ticker );
return $mech;
}
sub parse_page {
my $mech = shift;
my $tree = HTML::TreeBuilder->new;
$tree->parse($mech->content());
my $table = $tree->look_down('_tag','table');
foreach my $row ($table->look_down('_tag', 'tr')) {
foreach my $cell ($row->look_down('_tag', 'td')) {
my $text = $cell->as_text;
$text =~ s/Today.s Volume/Today's Volume/;
print "-", $text,"\n";
print "-", $cell->as_HTML,"\n";
}
}
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20120928/4fb94185/attachment.html>
More information about the Chicago-talk
mailing list