[Pdx-pm] regex & phone numbers...

Tkil tkil at scrye.com
Thu Jul 25 14:28:49 CDT 2002


>>>>> "Kari" == Kari Chisholm <karic at lclark.edu> writes:

Kari> I want to convert the actual seven- or ten-digit phone number part to
Kari> just xxx-xxx-xxxx.  I also want to leave alone anything that comes
Kari> after that - which is obviously the tough part.  The logic should be
Kari> basically this: just process through the number left to right,
Kari> grabbing the first seven or ten numbers, then reformat those and tack
Kari> on whatever's left.  The challenge is figuring out when it's a
Kari> seven-digit or a ten-digit number.

Kari> I've conceptualized any number of highly complex and idiotic
Kari> ways of doing this.  I'm just wondering if there's a simpler
Kari> regex approach to this...  Any ideas?

Get a list of all the formats you think you need to worry about, write
a set of regexps that can handle all of them, then return an error if
you can't parse a new one.  That list bit is important; this keeps you
from making assumptions that might trip you up.

[I unintentionally did this to GBARR's Date::Parse::str2time function.
Going through a few 100k mail messages, I found about 0.1% that had
bogus date strings that it couldn't parse gracefully.  Bit of a stress
test there.  And a sign of over-zealous error checking: str2time
rejected New Zealand Daylight Saving Time, because UTC+1300 is
"obviously" a bogus time zone...]

A straightforward version might be:

| #!/usr/bin/perl -w
| 
| use strict;
| 
| sub normalize_phone_number ( $ $ )
| {
|     my ($in, $default_ac) = @_;
| 
|     # abbreviations
|     my $d3 = '(\d{3})';
|     my $d4 = '(\d{4})';
| 
|     # is it already sane?
|     $in =~ /^    $d3    [\.\-\s] $d3 [\.\-\s] $d4 \s* (.*)/x
|       and return "$1-$2-$3 $4";
| 
|     # area code in parens
|     $in =~ /^ \( $d3 \) [\-\s]   $d3 [\-\s]   $d4 \s* (.*)/x
|       and return "$1-$2-$3 $4";
| 
|     # missing area code
|     $in =~ /^                    $d3 [\.\-\s] $d4 \s* (.*)/x
|       and return "$default_ac-$1-$2 $3";
| 
|     return;
| }
| 
| while (my $in = <DATA>)
| {
|     chomp $in;
|     if (my $out = normalize_phone_number $in, '503')
|     {
|         printf "%-30s => %s\n", $in, $out;
|     }
|     else
|     {
|         print "$in: couldn't parse!\n";
|     }
| }
| 
| __END__
| (503) 123-4567
| 503.123.4567
| (503)-123-4567
| 123-4567
| 503 123 4567
| 503-123-4567 ext. 89
| 123-4567 ext. 89
| 503-123-4567-mom's house
| 858-123-4239 x23

t.



More information about the Pdx-pm-list mailing list