[Chicago-talk] parsing HTML

Jim Thomason jim at jimandkoka.com
Fri Feb 23 13:34:56 PST 2007


How much control do you have over the HTML that you're parsing? If you
can make reasonable assumptions about its structure, your life is much
easier. I'm a big proponent of using regexes to keep your life simple
instead of parsers or new modules or whatnot. But, if you have no
control about your source HTML, a parser from the getgo would probably
make your life easier.

Yes, yes, yes, insert standard lines here about how if the job gets
much more complicated, you may need to go back and re-engineer to use
a parser. Yadda yadda yadda.

So, what do you know about this data? for example, would />([^<]+)</
do what you want?

I suppose />(\s*[^<\s][^<]*)</ if you want to extract something with a
non-whitespace character.

-Jim.....

On 2/23/07, Jay Strauss <me at heyjay.com> wrote:
> Hi,
>
> I need to parse out the text from HTML like:
>
> <SPAN class="main-body"><B>Street Address</B></SPAN>
>
> to pluck out "Street Address"
>
> or
>
> <SPAN class="main-body">
>                                 <span id="UcGeoResult11_lbZipCode"><font color="
> Navy">60643</font></span></SPAN>
>
> to pluck out "60643"
>
> Would you suggest using a regex (that I can't get to work) or some
> module (like HTML::Parser)?
>
> Thanks
> Jay
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>


More information about the Chicago-talk mailing list