Regex stumper?

Daniel Chetlin danchetlin at
Thu Jul 27 20:30:37 CDT 2000

On Thu, Jul 27, 2000 at 04:24:27PM -0700, Ben Marcotte wrote:
> I learn something new everyday.  Apparently, somewhere between the
> publishing of the 2nd Ed Camel book and a recent version of perl (5.005_03),
> a handy regex piece was added: the negative lookbehind assertion,
> (?<!pattern).  It can be used to test whether a given pattern (of fixed
> length) can _not_ be found before some other piece of regex.  Thus we can
> check for the lack of a mailto: before an address.  However, you might need
> a recent version of perl to do this.  Does anyone out there actually know
> which version of perl first supported this feature?

It appeared in 5.005 (perldoc perl5005delta). One of Ilya's ingenious

> s#(?<!mailto:)\b([\w\-]+\@[\w\-]+\.[\w\-]+)\b#<a href="mailto:$1">$1</a>#g;

This is very nice; I hadn't considered using the 'mailto'.

My solution takes a slightly different tack; it assumes that you want to make
the substitution on any address outside of an HTML tag, and on none inside of
one. This is not necessarily the best assumption to make, but since we're
already making some shaky assumptions about the way email addresses and HTML
are formatted (see Jeff Friedl's "Mastering Regular Expressions" if you really
want to match an email address), I think it's probably OK. And it doesn't use
lookbehind, so it's usable on older Perls.

  \b([-\w]+@[-\w]+\.[-\w]+)\b #The address part, a little cleaner
  (?!                         #Negative lookahead: not followed by
    [^<>]*                    #Any number of non angle brackets
    >                         #and then a right angle bracket
  )                           #In other words, we're not inside an HTML tag
}{<a href="mailto:$1">$1</a>}gx;

This one runs into problems if you have badly formatted HTML, of course.

So why use it? If you're using a Perl below 5.005 (but at least 5.000), if you
want to avoid other embedded addresses than just mailto anchors, or if you're
playing Perl golf (the change from lookbehind to lookahead saves one
character, and the other minor changes I made save a few more -- obviously
after you cut out the /x stuff) ;-).

Dunno if this is helpful, but it was fun.


More information about the Pdx-pm-list mailing list