[Chicago-talk] parsing HTML
Steven Lembark
lembark at wrkhors.com
Sun Mar 4 12:49:56 PST 2007
-- Jim Thomason <jim at jimandkoka.com>
>> how come />(\s*[^<]*)</ doesn't work?
>
> Define "doesn't work". That regex will extract the text from between
> angle brackets (i.e., stuff outside of HTML tags) (assuming valid
> html, blah blah blah), but it also extracts blank space. Such as this:
>
> <b> </b>
>
> \s* would match all the spaces, then [^<]* would match nothing. It's a
> fine regex if you don't mind extracting out empty snippets. But, in
> that case, />([^<]*)</ will work just as well, since the spaces will
> just be captured by the [^<]*. Note that by using * you allow for
> matching nothing, ala <b></b>
You might find that
m{ > ( .+? ) < }xs
works equally well (i.e., shortest match between two angle
braces) but is a bit faster.
>
>>> I suppose />(\s*[^<\s][^<]*)</ if you want to extract something with a
>>> non-whitespace character.
>> I don't understand that regex.
>
> Break it down. It matches 0 or more spaces, then one character that's
> not a space or a <, then any number of additional characters that
> aren't <'s (but could be spaces). So that regex would not match <b>
> </b> or <b></b>, since there are no non-space characters in there.
> "<b> string </b>" would be matched by either regex.
Possible typo:
/>(\s*[^<\s][^<]*)</
?
should have been
/>(\s*[^<\S][^<]*)</
?
or
m{ > \s* (\S .+? ) \s* <}
--
Steven Lembark 85-09 90th Street
Workhorse Computing Woodhaven, NY 11421
lembark at wrkhors.com 1 888 359 3508
More information about the Chicago-talk
mailing list