[Pdx-pm] regexp and semi-greedy match

Eric Wilhelm scratchcomputing at gmail.com
Sun Oct 14 23:48:55 PDT 2007


# from Keith Lofstrom
# on Sunday 14 October 2007 22:03:

>"Greedy" regexp is just a tiny bit too greedy.  If I use a pattern
> match like:
>
>   if( /(a-z0-9_.-)-(\d*)\.(raw)$/i ) {     # this does NOT work

I don't think it is a greedy bug.  The first group is literal.  Are you 
trying for a character class (needs square brackets) and why?

   m/^(.*)-(\d+)\.raw$/

I'm also not sure about the capturing on "raw", which is a constant.  
(Perhaps it is going to change and you want to capture anything which 
is not-a-dot until the end:  qr/-(\d+)\.([^.]+)$/ .)

Another trick in situations like this is to not bother capturing if you 
happen to have a disposable copy of the scalar.  Just whack the 
interesting and/or messy bits off of the end.

  my $num;
  if($scalar =~ s/-(\d+)\.raw$//) {
    $num = $1;
  }
  else {
    die "didn't expect that input"; # or you could next
  }

  # $scalar is now just the base bit

Another note:  "greedy" typically causes failed captures, not failed 
matches.  The greed comes into play when multiple .* (or similar) might 
match in more than one way.  The regexp engine resolves the ambiguity 
by stuffing as much as possible into the first submatch (but curbs its 
gluttony short of invalidating the entire match.)

In this case the \d* could cause the match to hit on a mal-formatted 
string (and your $1 would get the whole string.)  The \d+ and the 
\.raw$ anchor things though (and cause the whole match to fail if 
something went awry.)

--Eric
-- 
"Time flies like an arrow, but fruit flies like a banana."
--Groucho Marx
---------------------------------------------------
    http://scratchcomputing.com
---------------------------------------------------


More information about the Pdx-pm-list mailing list