[Chicago-talk] parsing HTML

Steven Lembark lembark at wrkhors.com
Sun Mar 4 12:49:56 PST 2007

-- Jim Thomason <jim at jimandkoka.com>

>> how come />(\s*[^<]*)</ doesn't work?
> Define "doesn't work". That regex will extract the text from between
> angle brackets (i.e., stuff outside of HTML tags) (assuming valid
> html, blah blah blah), but it also extracts blank space. Such as this:
> <b>     </b>
> \s* would match all the spaces, then [^<]* would match nothing. It's a
> fine regex if you don't mind extracting out empty snippets. But, in
> that case, />([^<]*)</ will work just as well, since the spaces will
> just be captured by the [^<]*. Note that by using * you allow for
> matching nothing, ala <b></b>

You might find that

    m{ > ( .+? ) < }xs

works equally well (i.e., shortest match between two angle
braces) but is a bit faster.

>>> I suppose />(\s*[^<\s][^<]*)</ if you want to extract something with a
>>> non-whitespace character.
>> I don't understand that regex.
> Break it down. It matches 0 or more spaces, then one character that's
> not a space or a <, then any number of additional characters that
> aren't <'s (but could be spaces). So that regex would not match <b>
> </b> or <b></b>, since there are no non-space characters in there.
> "<b>   string   </b>" would be matched by either regex.

Possible typo:

should have been



    m{ > \s* (\S .+? ) \s* <}

Steven Lembark                                       85-09 90th Street
Workhorse Computing                                Woodhaven, NY 11421
lembark at wrkhors.com                                     1 888 359 3508

More information about the Chicago-talk mailing list