SPUG: RE / Split Question

Fred Morris m3047 at inwa.net
Thu Jul 31 10:01:59 CDT 2003


John Krahn wrote:
>On Wednesday 30 July 2003 16:54, Orr, Chuck (NOC) wrote:
>>
>> Please help with the following dilemma:
>>
>>      I am being given a glob of data from a web page that I need to
>> fix with perl.  It comes in as $blob looking like this:
>>
>> 425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t ...
>>
>> I need to break this up so the word characters associated with the
>> numbers stay with their numbers.  Ideally, I would have an array like
>> this:
>>
>> 425 501 sttlwa01t
>> 425 712 sttlwa01t tacwa02t
>> 425 337 tacwa02t
>>
>> As you can see, I am not assured of the number of words that will
>> follow each set of numbers.  Could you please suggest a split or some
>> other tool that will turn the glob into the fix?
>> $new_array = [ split /(?=[A-Z]\s\d)/,$scalar ];
>>
>> Which is as close as we got, does not work.  It keeps the split
>> characters, but in a funky way that I cannot deal with.  It also will
>> always miss the last chunk of the glob.
>
>
>How about this?
>
>$ perl -le'
>$glob = "425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t ";
>
>@array = $glob =~ /( \b\d+ \s+ \d+ (?:\s+ \D\w*)+ )/xg;
>
>print for @array;
>'
>425 501 sttlwa01t
>425 712 sttlwa01t tacwa02t
>425 337 tacwa02t

This gets my vote, as it almost entirely avoids Content Imposition Disorder.


I note several things about the problem and the context in which it was stated.

1. it is "..data from a web page".

2. all records start with 425

3. the poster posted from AT&T Wireless


Presumably (3) implies Bellevue, WA.

The area code for Bellevue, WA. is 425.

There is no standard 425 web server error.


I think it's reasonable to infer that (1) is a snark.

I also think it's reasonable to infer that the implicit records consist of
an area code, an exchange, plus some string of codes, possibly cell relays.
But that's just CID setting in, if the poster didn't see fit to provide
this much content in their initial post, why should I impose it?

(If it's data from a web page, it's likely that the area code and exchange
come in as single-valued parameters and that the cells are a multivalued
parameter.)

If I was getting paid to make such a judgement call, the expression I'd use
would be:

  @array = $glob =~ /( \b\d{3} \s+ \d{3} (?:\s+ \D\w*)+ )/xg;

Which imposes more content than the original solution (or does it?). Even
though it's an imposition of content, I suspect your notion of the cell
identifiers being legal identifiers is bang on.


But, only the depp knows for sure! (Oh what do we do with broken data? What
do we do with broken data? What do we do with broken data earlie in the
mornin'!)

--

Fred Morris
m3047 at inwa.net





More information about the spug-list mailing list