[Chicago-talk] Help with Regex
Steven Lembark
lembark at wrkhors.com
Sun Apr 2 20:12:35 PDT 2023
On Sat, 24 Dec 2022 10:59:14 -0500
"Richard Reina" <richard at rushlogistics.com> wrote:
Been crazy for a while, haven't kept up on email.
Step 1: Break your parsing up into chunks for the most-
specific portions of the input and take the more variable
parts last.
You have a regex looking for:
<variable stuff> [, ]+ <always two letters> [, ]+ [\d-]
If the variable stuff were at the start then a regex would
make sense; when the variable stuff is that the end it's
less work to split it apart on separators, take the state &
zip as-is, and then re-combine whatver's left to get the
city:
my ( $city, $state, $zip )
= do
{
my @wordz = split /\s+/, $line;
s{,$}{} for @wordz;
# what's left are the city, split into words,
# followed by a state and a zip.
my $z = pop @wordz;
my $s = pop @wordz;
( join( ' ' => @wordz ), $s, $z )
};
that or use separate regexen to snag the state and zip up
front then use them as anchors for whatever's before them:
# make sure there isn't any extraneous cruft at the
# line end or extra/non-space separators.
$line =~ s{\s+ $}{}x;
$line =~ s{ \s+ }{ };
# at that point the state and zip are pretty specific:
my ( $state, $zip ) = $line =~ m{ (\w{2}) [, ]+ ([\d-]+) $}x;
my ( $city ) = $line =~ m{ (.+?) [, ]+ $state .+? $zip $}x;
or even:
my ( $state, $zip )
= $line
=~ m
{
(\w{2}) # two char state
[, ]+ # separators
([\d]{5} (?:-\d{4}?) $
}x;
my ( $city ) = $line =~ m{^ (.+) [, ]+ $state [, ]+ $zip $}x;
last approach is stripping the state and zip off and taking
whatever's left as the city (probably faster):
my ( $state, $zip ) = blah blah; see above
my ( $city = $line ) # copy the line
=~ s{ [, ]+ $state [, ]+ $zip $}{}x; # strip the state & zip.
--
Steven Lembark
Workhorse Computing
lembark at wrkhors.com
+1 888 359 3508
More information about the Chicago-talk
mailing list