[Chicago-talk] Help with Regex

Richard Reina richard at rushlogistics.com
Mon Apr 3 09:11:36 PDT 2023


I ended up doing this. Seems to be working.

 
#######################################																													   
sub Check4CityStZip {  ##### INTERNAL SUB																												     
#######################################																													   
																										   
    my ($address) = @_;
    my ($city, $state, $zip);
    my $regex = qr/																																		   
		 ^																																				    
		 ([^,]+)																																			  
		 ,\s																																				  
		 ([A-Z]{2}),?																																	     
		 \s																																				   
		 (\d{5}(?:-?\d{4})?)																																  
		 $																																				    
		 /x;
    if ($address =~ m/$regex/) {

	    my ($city, $state, $zip) = ($1,$2,$3);
	    print "FOUND CITY STATE ZIP: $city, $state $zip\n";
	    #print "I think I found a city state zip: $address\n";																							    

	    return ($city, $state, $zip);

    } else {

	    return 'NOPE';

    }

##################																																		    
} #EOS sub Check4CityStZip																																    
##################				 





On Sun, 2 Apr 2023 23:12:35 -0400, Steven Lembark <lembark at wrkhors.com> wrote:

On Sat, 24 Dec 2022 10:59:14 -0500
"Richard Reina" wrote:

Been crazy for a while, haven't kept up on email.

Step 1: Break your parsing up into chunks for the most-
specific portions of the input and take the more variable
parts last.

You have a regex looking for:

[, ]+ [, ]+ [\d-]

If the variable stuff were at the start then a regex would
make sense; when the variable stuff is that the end it's
less work to split it apart on separators, take the state &
zip as-is, and then re-combine whatver's left to get the
city:

my ( $city, $state, $zip )
= do
{
my @wordz = split /\s+/, $line;
s{,$}{} for @wordz;

# what's left are the city, split into words,
# followed by a state and a zip.

my $z = pop @wordz;
my $s = pop @wordz;

( join( ' ' => @wordz ), $s, $z )
};

that or use separate regexen to snag the state and zip up
front then use them as anchors for whatever's before them:

# make sure there isn't any extraneous cruft at the
# line end or extra/non-space separators.

$line =~ s{\s+ $}{}x;
$line =~ s{ \s+ }{ };

# at that point the state and zip are pretty specific:

my ( $state, $zip ) = $line =~ m{ (\w{2}) [, ]+ ([\d-]+) $}x;
my ( $city ) = $line =~ m{ (.+?) [, ]+ $state .+? $zip $}x;

or even:

my ( $state, $zip )
= $line
=~ m
{
(\w{2}) # two char state
[, ]+ # separators
([\d]{5} (?:-\d{4}?) $
}x;

my ( $city ) = $line =~ m{^ (.+) [, ]+ $state [, ]+ $zip $}x;

last approach is stripping the state and zip off and taking
whatever's left as the city (probably faster):

my ( $state, $zip ) = blah blah; see above

my ( $city = $line ) # copy the line
=~ s{ [, ]+ $state [, ]+ $zip $}{}x; # strip the state & zip.


--
Steven Lembark
Workhorse Computing
lembark at wrkhors.com
+1 888 359 3508
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
https://mail.pm.org/mailman/listinfo/chicago-talk
 


More information about the Chicago-talk mailing list