[Chicago-talk] parsing HTML

Steven Lembark lembark at wrkhors.com
Sun Mar 4 12:49:56 PST 2007



-- Jim Thomason <jim at jimandkoka.com>

>> how come />(\s*[^<]*)</ doesn't work?
>
> Define "doesn't work". That regex will extract the text from between
> angle brackets (i.e., stuff outside of HTML tags) (assuming valid
> html, blah blah blah), but it also extracts blank space. Such as this:
>
> <b>     </b>
>
> \s* would match all the spaces, then [^<]* would match nothing. It's a
> fine regex if you don't mind extracting out empty snippets. But, in
> that case, />([^<]*)</ will work just as well, since the spaces will
> just be captured by the [^<]*. Note that by using * you allow for
> matching nothing, ala <b></b>

You might find that

    m{ > ( .+? ) < }xs

works equally well (i.e., shortest match between two angle
braces) but is a bit faster.

>
>>> I suppose />(\s*[^<\s][^<]*)</ if you want to extract something with a
>>> non-whitespace character.
>> I don't understand that regex.
>
> Break it down. It matches 0 or more spaces, then one character that's
> not a space or a <, then any number of additional characters that
> aren't <'s (but could be spaces). So that regex would not match <b>
> </b> or <b></b>, since there are no non-space characters in there.
> "<b>   string   </b>" would be matched by either regex.


Possible typo:

    />(\s*[^<\s][^<]*)</
              ?
should have been

    />(\s*[^<\S][^<]*)</
              ?

or

    m{ > \s* (\S .+? ) \s* <}


-- 
Steven Lembark                                       85-09 90th Street
Workhorse Computing                                Woodhaven, NY 11421
lembark at wrkhors.com                                     1 888 359 3508


More information about the Chicago-talk mailing list