[Chicago-talk] Question about removing '’'

Jay Strauss me at heyjay.com
Fri Sep 28 07:41:55 PDT 2012


Hi,

I'm scraping a web page (code below) using HTML::TreeBuilder.  I'm trying
to get the info between the <td> </td>, but embedded in some of the values
is a ’  like:

<td align="left" nowrap>Today’s Volume</td>

What I want to do is remove the "’" or convert to a single quote,
within the HTML::TreeBuilder object, figuring that's probably a more
reliable approach.

What I'm currently doing is just converting to text and doing a regex

my $text = $cell->as_text;
$text =~ s/Today.s Volume/Today's Volume/;

Any suggestions on how to do this?

Thanks
Jay




use strict;

use WWW::Mechanize;
use HTML::TreeBuilder 5 -weak;
use Data::Dumper;

my $mech = retrieve_graham_quote("DELL");
my $info = parse_page($mech);


sub retrieve_graham_quote {

my $ticker = shift;

my $base_url = 'http://www.grahaminvestor.com/quotes/?ticker=';

my $mech = WWW::Mechanize->new();
$mech->get( $base_url.$ticker );

return $mech;

}

sub parse_page {

my $mech = shift;

my $tree = HTML::TreeBuilder->new;
$tree->parse($mech->content());

my $table = $tree->look_down('_tag','table');

foreach my $row ($table->look_down('_tag', 'tr')) {

foreach my $cell ($row->look_down('_tag', 'td')) {

my $text = $cell->as_text;
 $text =~ s/Today.s Volume/Today's Volume/;

print "-", $text,"\n";
print "-", $cell->as_HTML,"\n";

}
}

}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20120928/4fb94185/attachment.html>


More information about the Chicago-talk mailing list