LPM: regex snippets
Frank Price
fprice at mis.net
Thu Nov 11 01:18:43 CST 1999
Hi lexpm! I have been dealing with some thorny regex's (thorny for
me, that is) and thought I'd share some. Sorry this is kindof long.
Interested in all comments ...
First some background: this script is a crontab filter; it takes a
crontab entry (crontab is the *nix facility for automatic job
scheduling) and presents it in a more human readable format. A
typical entry looks like this:
1,11,21,31,41,51 1-5,17-23 * * 2 /usr/local/bin/blah
Which means "run /usr/local/bin/blah every Tuesday at 1,11,21,31,41,
and 51 minutes past between 1 and 5 am and also between 5 and 11pm".
Task 1) Take a string, which may contain commas and also ranges, and
return a list of all the numbers. Ex: for "1-5,9,12" it
should yield (1,2,3,4,5,9,12).
Code 1)
@list=splitcommas($entry);
sub splitcommas {
my ($string) = @_;
if ( $string =~ s/(\d+)(-)(\d+)/join(',', ($1 .. $3))/e ) {
splitcommas($string);
} elsif ( $string =~ /,/ ) { # if at least one comma, split on it
split(',', $string);
} else { ($string); } # handles no commas or dash case; i.e. single num
}
Comment 1)
The main work is done with the substitute in the "if". This pattern
says "if you find two sets of integers joined by a dash, replace it
with the inclusive range of those numbers joined by commas". Then we
recuse on this fcn to split the commas. The cool thing about using
the /e modifier is that the right hand side can be a perl
expression. This let me just replace the ranges by a comma
separated string, and it kept the entries in order. Took a while to
get this one :-)
Task 2) Pad all single digits in comma separated string with leading
zero. So "1,3,5,10,12" yields "01,03,05,10,12".
Code 2)
$string =~ s/(,|^)(\d)(?=(?:,|$))/${1}0$2/g;
Comment 2)
Still not sure /exactly/ what's going on here! The lhs says "match
either a comma or start-of-line; then a single digit; then just look
ahead to see if the next character is either a comma or
end-of-line." Parens around the first two make it remember the
match. Then the rhs says "replace that with (comma or start-line)
followed by 0 followed by the digit." The key (I think) is that the
look ahead is what they call zero-width, so it doesn't actually
increment the pattern matcher's record of where it is in the string.
That's why I don't have to put the trailing comma/end-string back
in.
Another easier way would be to split on commas, pad each number with
s/^(\d)$/0$1/, and then join again with commas. TMTOWTDI...
Task 3) Take a list of numbers and change each to the correct cardinal
(?) representation. Ex. (1,11,21) yields (1st, 11th, 21st).
Code 3)
foreach $day (@days) {
if ( $day =~ /^1$/ || $day =~ /[^1]1$/ ) { $day .= "st" }
elsif ( $day =~ /^2$/ || $day =~ /[^1]2$/ ) { $day .= "nd" }
elsif ( $day =~ /^3$/ || $day =~ /[^1]3$/ ) { $day .= "rd" }
else { $day .= "th" }
}
Comment 3)
This is the one I mentioned at the meeting. Someone suggested a
hash and that's a good solution in this case; but maybe not if the
range gets bigger than 30 numbers! I would have like to put the
match into one regex but thought it might slow it down. Here the
logic is "if the number is a 1, or ends in a non-1 and then a 1,
add an "st" to it." So on for 2 and 3; everything else falls thru
to the "th" case. It is important to have the two disjuncts in
that order, I think.
Thanks for listening, and please tell me if you see better ways to do
any of this!
-Frank.
--
Frank Price
fprice at mis.net
sub splitcommas {
# take string with possible commas and ranges (-)
# rtns array consisting of elts
# E.g., hours entry = "0-6,9,12,15,17,19-23"
# gives @=(0,1,2,3,4,5,6,9,12,15,17,19,20,21,22,23);
my ($string) = @_;
# This pattern says "if you find two sets of integers joined by a
# dash, replace it with the inclusive range of those numbers joined
# by commas". Then we recuse on this fcn to split the commas
if ( $string =~ s/(\d+)(-)(\d+)/join(',', ($1 .. $3))/e ) {
splitcommas($string);
} elsif ( $string =~ /,/ ) { # if at least one comma, split on it
split(',', $string);
} else { ($string); } # handles no commas or dash case; i.e. single num
}
More information about the Lexington-pm
mailing list