[Chicago-talk] Question about removing '’'

Jay Strauss me at heyjay.com
Fri Sep 28 08:33:57 PDT 2012


opps

HTML::TreeBuilder object

On Fri, Sep 28, 2012 at 10:33 AM, Jay Strauss <me at heyjay.com> wrote:

> Thanks Doug.  I'm not sure how that's different than what I'm doing?
>
> In that I want to actually change the contents within the
> HTML::TreeObject, and not just decode (or regex) the output of
> $cell->as_HTML.
>
> Maybe I missed something
>
> Thanks
> Jay
>
>
> On Fri, Sep 28, 2012 at 9:53 AM, Doug Bell <madcityzen at gmail.com> wrote:
>
>>
>> On Sep 28, 2012, at 9:41 AM, Jay Strauss <me at heyjay.com> wrote:
>>
>> > Hi,
>> >
>> > I'm scraping a web page (code below) using HTML::TreeBuilder.  I'm
>> trying to get the info between the <td> </td>, but embedded in some of the
>> values is a ’  like:
>> >
>> > <td align="left" nowrap>Today’s Volume</td>
>> >
>> > What I want to do is remove the "’" or convert to a single quote,
>> within the HTML::TreeBuilder object, figuring that's probably a more
>> reliable approach.
>>
>> That &foo; construct is an "HTML Entity", which the HTML::Entities module
>> can decode for you, like:
>>
>> use HTML::Entities qw( decode_entities );
>> print decode_entities( 'That’s all folks!' );
>>
>> That entity is specifically a right-angled single quote, so if that exact
>> character is not what you want, then you could use your regular expression
>> to change it to a straight single quote (the ' character).
>>
>> Doug Bell
>> madcityzen at gmail.com
>>
>>
>>
>> _______________________________________________
>> Chicago-talk mailing list
>> Chicago-talk at pm.org
>> http://mail.pm.org/mailman/listinfo/chicago-talk
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20120928/dff72aa2/attachment-0001.html>


More information about the Chicago-talk mailing list