[Pdx-pm] regexp and semi-greedy match

Keith Lofstrom keithl at kl-ic.com
Sun Oct 14 22:03:45 PDT 2007

As fly is to sledgehammer, my problem is to this group.  However.

I work with a proprietary program that starts with a basename foo,
and makes lots of files that look like:

foo-1.raw, foo-2.raw, ...  foo-9.raw, foo-10.raw, foo-11.raw
...  foo-99.raw, foo-100.raw, 

... et cetera.  It appends an integer sequence count and .raw to the
basename.  Another proprietary program takes these files and processes
them in alphabetical sort order:

foo-100,raw, foo-10.raw, foo-11.raw, ... , foo-1.raw, foo-20.raw, ...,
foo-2.raw, foo-30.raw, .... etc.

Which is wrong, and very ugly.  I can't rewrite either program.  Uglier. 

However, I can rename the files:  foo-1.raw --> foo-001.raw,
foo-10.raw --> foo-010.raw  etc.  So the trick is pattern matching the
number part and inserting some leading zeros.  This is tricky because
some perverse user could use a base file name of, say, foo-121.raw,
resulting in output files named foo-121.raw-1.raw .  I need to match
the second number 1, not the 121.

"Greedy" regexp is just a tiny bit too greedy.  If I use a pattern match

   if( /(a-z0-9_.-)-(\d*)\.(raw)$/i ) {     # this does NOT work

Then the first group grabs the whole filename, leaving nothing for
the other two groups to grab, and the match fails.  I would like it
to produce $1=basename, $2=number to zero pad, $3=suffix.

What I ended up doing is cheesy:

   if( /(\w*)-(\d*)\.(raw)/i ) {    # this finds plausible target files
      my $r  = rindex( $_ , "-" );
      my $f  = substr( $_ , 0, $r );
      my $nt = substr( $_ , $r );
      $nt    =~ m/(\d*)\.(raw)$/i ;

... and my match output variables are $f, $1, $2 instead of $1, $2, $3 .
It works, but it is barfogenic, and it might match on files it shouldn't.
Hence the question:

*** QUESTION:  Is there a regexp that does a "successfully greedy" ***
*** match and splits out the match variables I want?               ***


Petty details:

My cheesy programs can be found at http://www.keithl.com/ndir (good) and
http://www.keithl.com/ndir1 (bad).  Please do not read them if you are
pregnant or have a weak heart.   I imagine Randal could do the same
job with a one-liner.  The examples in the email above are somewhat
simplified.  In real life there are multiple basenames, and two sets
of output files with -nnn.raw and -nnn.out appended, so (raw) becomes
(raw|out) .  Also, I prepend only enough zeros to each sequence of
files in a directory to make them all come out right, so no monsters
like -000023.raw, unless there are actually files with number parts
that long.  I do two passes, the first time to find the longest number
for each basename.

Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs

More information about the Pdx-pm-list mailing list