[Chicago-talk] parsing HTML
me at heyjay.com
Fri Feb 23 13:50:54 PST 2007
On 2/23/07, Jim Thomason <jim at jimandkoka.com> wrote:
> How much control do you have over the HTML that you're parsing? If you
> can make reasonable assumptions about its structure, your life is much
> easier. I'm a big proponent of using regexes to keep your life simple
> instead of parsers or new modules or whatnot. But, if you have no
> control about your source HTML, a parser from the getgo would probably
> make your life easier.
> Yes, yes, yes, insert standard lines here about how if the job gets
> much more complicated, you may need to go back and re-engineer to use
> a parser. Yadda yadda yadda.
> So, what do you know about this data? for example, would />([^<]+)</
> do what you want?
> I suppose />(\s*[^<\s][^<]*)</ if you want to extract something with a
> non-whitespace character.
I have zero control over the HTML returned. I'm parsing the results of
More information about the Chicago-talk