[Chicago-talk] parsing HTML

Jim Thomason jim at jimandkoka.com
Fri Feb 23 15:45:00 PST 2007


> how come />(\s*[^<]*)</ doesn't work?

Define "doesn't work". That regex will extract the text from between
angle brackets (i.e., stuff outside of HTML tags) (assuming valid
html, blah blah blah), but it also extracts blank space. Such as this:

<b>     </b>

\s* would match all the spaces, then [^<]* would match nothing. It's a
fine regex if you don't mind extracting out empty snippets. But, in
that case, />([^<]*)</ will work just as well, since the spaces will
just be captured by the [^<]*. Note that by using * you allow for
matching nothing, ala <b></b>


>> I suppose />(\s*[^<\s][^<]*)</ if you want to extract something with a
>> non-whitespace character.
>I don't understand that regex.

Break it down. It matches 0 or more spaces, then one character that's
not a space or a <, then any number of additional characters that
aren't <'s (but could be spaces). So that regex would not match <b>
</b> or <b></b>, since there are no non-space characters in there.
"<b>   string   </b>" would be matched by either regex.

-Jim.....


More information about the Chicago-talk mailing list