[Chicago-talk] Help with Regex

Steven Lembark lembark at wrkhors.com
Sun Apr 2 20:12:35 PDT 2023


On Sat, 24 Dec 2022 10:59:14 -0500
"Richard Reina" <richard at rushlogistics.com> wrote:

Been crazy for a while, haven't kept up on email. 

Step 1: Break your parsing up into chunks for the most-
specific portions of the input and take the more variable
parts last.

You have a regex looking for:

    <variable stuff> [, ]+ <always two letters> [, ]+ [\d-] 

If the variable stuff were at the start then a regex would
make sense; when the variable stuff is that the end it's 
less work to split it apart on separators, take the state &
zip as-is, and then re-combine whatver's left to get the 
city:

    my ( $city, $state, $zip )
    = do
    {
        my @wordz   = split /\s+/, $line;
        s{,$}{} for @wordz;
        
        # what's left are the city, split into words,
        # followed by a state and a zip.

        my $z   = pop @wordz;
        my $s   = pop @wordz;

        ( join( ' ' => @wordz ), $s, $z )
    };

that or use separate regexen to snag the state and zip up 
front then use them as anchors for whatever's before them:

    # make sure there isn't any extraneous cruft at the
    # line end or extra/non-space separators.

    $line   =~ s{\s+ $}{}x;
    $line   =~ s{ \s+ }{ };

    # at that point the state and zip are pretty specific:

    my ( $state, $zip   ) = $line =~ m{ (\w{2}) [, ]+ ([\d-]+) $}x;
    my ( $city          ) = $line =~ m{ (.+?) [, ]+ $state .+? $zip $}x;

or even:

    my ( $state, $zip   ) 
    = $line 
    =~ m
    {
        (\w{2})                 # two char state
        [, ]+                   # separators
        ([\d]{5} (?:-\d{4}?) $
    }x;

    my ( $city ) = $line =~ m{^ (.+) [, ]+ $state [, ]+ $zip $}x;

last approach is stripping the state and zip off and taking
whatever's left as the city (probably faster):

    my ( $state, $zip ) = blah blah; see above

    my ( $city = $line )                    # copy the line
    =~ s{ [, ]+ $state [, ]+ $zip $}{}x;    # strip the state & zip.


-- 
Steven Lembark
Workhorse Computing
lembark at wrkhors.com
+1 888 359 3508


More information about the Chicago-talk mailing list