From lembark at wrkhors.com Sun Apr 2 20:12:35 2023 From: lembark at wrkhors.com (Steven Lembark) Date: Sun, 2 Apr 2023 23:12:35 -0400 Subject: [Chicago-talk] Help with Regex In-Reply-To: <1671897554.m4po1g1pmo04s8ok@hostingemail.digitalspace.net> References: <1671897554.m4po1g1pmo04s8ok@hostingemail.digitalspace.net> Message-ID: <20230402231235.43b7f1fd.lembark@wrkhors.com> On Sat, 24 Dec 2022 10:59:14 -0500 "Richard Reina" wrote: Been crazy for a while, haven't kept up on email. Step 1: Break your parsing up into chunks for the most- specific portions of the input and take the more variable parts last. You have a regex looking for: [, ]+ [, ]+ [\d-] If the variable stuff were at the start then a regex would make sense; when the variable stuff is that the end it's less work to split it apart on separators, take the state & zip as-is, and then re-combine whatver's left to get the city: my ( $city, $state, $zip ) = do { my @wordz = split /\s+/, $line; s{,$}{} for @wordz; # what's left are the city, split into words, # followed by a state and a zip. my $z = pop @wordz; my $s = pop @wordz; ( join( ' ' => @wordz ), $s, $z ) }; that or use separate regexen to snag the state and zip up front then use them as anchors for whatever's before them: # make sure there isn't any extraneous cruft at the # line end or extra/non-space separators. $line =~ s{\s+ $}{}x; $line =~ s{ \s+ }{ }; # at that point the state and zip are pretty specific: my ( $state, $zip ) = $line =~ m{ (\w{2}) [, ]+ ([\d-]+) $}x; my ( $city ) = $line =~ m{ (.+?) [, ]+ $state .+? $zip $}x; or even: my ( $state, $zip ) = $line =~ m { (\w{2}) # two char state [, ]+ # separators ([\d]{5} (?:-\d{4}?) $ }x; my ( $city ) = $line =~ m{^ (.+) [, ]+ $state [, ]+ $zip $}x; last approach is stripping the state and zip off and taking whatever's left as the city (probably faster): my ( $state, $zip ) = blah blah; see above my ( $city = $line ) # copy the line =~ s{ [, ]+ $state [, ]+ $zip $}{}x; # strip the state & zip. -- Steven Lembark Workhorse Computing lembark at wrkhors.com +1 888 359 3508 From richard at rushlogistics.com Mon Apr 3 09:11:36 2023 From: richard at rushlogistics.com (Richard Reina) Date: Mon, 03 Apr 2023 12:11:36 -0400 Subject: [Chicago-talk] Help with Regex In-Reply-To: <20230402231235.43b7f1fd.lembark@wrkhors.com> References: <1671897554.m4po1g1pmo04s8ok@hostingemail.digitalspace.net> <20230402231235.43b7f1fd.lembark@wrkhors.com> Message-ID: <1680538296.921oflfhiscggock@hostingemail.digitalspace.net> I ended up doing this. Seems to be working. ####################################### ? ? sub Check4CityStZip {? ##### INTERNAL SUB ??? ? ####################################### ? ? ??? ??? my ($address) = @_; ??? my ($city, $state, $zip); ??? my $regex = qr/ ? ? ^ ?? ? ([^,]+) ? ,\s ? ([A-Z]{2}),? ??? ? \s ? ? (\d{5}(?:-?\d{4})?) ? $ ?? ? /x; ??? if ($address =~ m/$regex/) { ??? my ($city, $state, $zip) = ($1,$2,$3); ??? print "FOUND CITY STATE ZIP: $city, $state $zip\n"; ??? #print "I think I found a city state zip: $address\n"; ?? ? ??? return ($city, $state, $zip); ??? } else { ??? return 'NOPE'; ??? } ################## ?? ? } #EOS sub Check4CityStZip ?? ? ################## ? On Sun, 2 Apr 2023 23:12:35 -0400, Steven Lembark wrote: On Sat, 24 Dec 2022 10:59:14 -0500 "Richard Reina" wrote: Been crazy for a while, haven't kept up on email. Step 1: Break your parsing up into chunks for the most- specific portions of the input and take the more variable parts last. You have a regex looking for: [, ]+ [, ]+ [\d-] If the variable stuff were at the start then a regex would make sense; when the variable stuff is that the end it's less work to split it apart on separators, take the state & zip as-is, and then re-combine whatver's left to get the city: my ( $city, $state, $zip ) = do { my @wordz = split /\s+/, $line; s{,$}{} for @wordz; # what's left are the city, split into words, # followed by a state and a zip. my $z = pop @wordz; my $s = pop @wordz; ( join( ' ' => @wordz ), $s, $z ) }; that or use separate regexen to snag the state and zip up front then use them as anchors for whatever's before them: # make sure there isn't any extraneous cruft at the # line end or extra/non-space separators. $line =~ s{\s+ $}{}x; $line =~ s{ \s+ }{ }; # at that point the state and zip are pretty specific: my ( $state, $zip ) = $line =~ m{ (\w{2}) [, ]+ ([\d-]+) $}x; my ( $city ) = $line =~ m{ (.+?) [, ]+ $state .+? $zip $}x; or even: my ( $state, $zip ) = $line =~ m { (\w{2}) # two char state [, ]+ # separators ([\d]{5} (?:-\d{4}?) $ }x; my ( $city ) = $line =~ m{^ (.+) [, ]+ $state [, ]+ $zip $}x; last approach is stripping the state and zip off and taking whatever's left as the city (probably faster): my ( $state, $zip ) = blah blah; see above my ( $city = $line ) # copy the line =~ s{ [, ]+ $state [, ]+ $zip $}{}x; # strip the state & zip. -- Steven Lembark Workhorse Computing lembark at wrkhors.com +1 888 359 3508 _______________________________________________ Chicago-talk mailing list Chicago-talk at pm.org https://mail.pm.org/mailman/listinfo/chicago-talk ?