[Chicago-talk] parsing HTML

Jay Strauss me at heyjay.com
Fri Feb 23 13:50:54 PST 2007


On 2/23/07, Jim Thomason <jim at jimandkoka.com> wrote:
> How much control do you have over the HTML that you're parsing? If you
> can make reasonable assumptions about its structure, your life is much
> easier. I'm a big proponent of using regexes to keep your life simple
> instead of parsers or new modules or whatnot. But, if you have no
> control about your source HTML, a parser from the getgo would probably
> make your life easier.
>
> Yes, yes, yes, insert standard lines here about how if the job gets
> much more complicated, you may need to go back and re-engineer to use
> a parser. Yadda yadda yadda.
>
> So, what do you know about this data? for example, would />([^<]+)</
> do what you want?
>
> I suppose />(\s*[^<\s][^<]*)</ if you want to extract something with a
> non-whitespace character.
>
> -Jim.....

I have zero control over the HTML returned.  I'm parsing the results of

http://www.ffiec.gov/Geocode/default.aspx


More information about the Chicago-talk mailing list