[Pdx-pm] regexp and semi-greedy match
Keith Lofstrom
keithl at kl-ic.com
Sun Oct 14 22:03:45 PDT 2007
As fly is to sledgehammer, my problem is to this group. However.
I work with a proprietary program that starts with a basename foo,
and makes lots of files that look like:
foo-1.raw, foo-2.raw, ... foo-9.raw, foo-10.raw, foo-11.raw
... foo-99.raw, foo-100.raw,
... et cetera. It appends an integer sequence count and .raw to the
basename. Another proprietary program takes these files and processes
them in alphabetical sort order:
foo-100,raw, foo-10.raw, foo-11.raw, ... , foo-1.raw, foo-20.raw, ...,
foo-2.raw, foo-30.raw, .... etc.
Which is wrong, and very ugly. I can't rewrite either program. Uglier.
However, I can rename the files: foo-1.raw --> foo-001.raw,
foo-10.raw --> foo-010.raw etc. So the trick is pattern matching the
number part and inserting some leading zeros. This is tricky because
some perverse user could use a base file name of, say, foo-121.raw,
resulting in output files named foo-121.raw-1.raw . I need to match
the second number 1, not the 121.
"Greedy" regexp is just a tiny bit too greedy. If I use a pattern match
like:
if( /(a-z0-9_.-)-(\d*)\.(raw)$/i ) { # this does NOT work
Then the first group grabs the whole filename, leaving nothing for
the other two groups to grab, and the match fails. I would like it
to produce $1=basename, $2=number to zero pad, $3=suffix.
What I ended up doing is cheesy:
if( /(\w*)-(\d*)\.(raw)/i ) { # this finds plausible target files
my $r = rindex( $_ , "-" );
my $f = substr( $_ , 0, $r );
my $nt = substr( $_ , $r );
$nt =~ m/(\d*)\.(raw)$/i ;
... and my match output variables are $f, $1, $2 instead of $1, $2, $3 .
It works, but it is barfogenic, and it might match on files it shouldn't.
Hence the question:
*** QUESTION: Is there a regexp that does a "successfully greedy" ***
*** match and splits out the match variables I want? ***
Keith
Petty details:
My cheesy programs can be found at http://www.keithl.com/ndir (good) and
http://www.keithl.com/ndir1 (bad). Please do not read them if you are
pregnant or have a weak heart. I imagine Randal could do the same
job with a one-liner. The examples in the email above are somewhat
simplified. In real life there are multiple basenames, and two sets
of output files with -nnn.raw and -nnn.out appended, so (raw) becomes
(raw|out) . Also, I prepend only enough zeros to each sequence of
files in a directory to make them all come out right, so no monsters
like -000023.raw, unless there are actually files with number parts
that long. I do two passes, the first time to find the longest number
for each basename.
--
Keith Lofstrom keithl at keithl.com Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs
More information about the Pdx-pm-list
mailing list