[tpm] Regex assistance

Thu Aug 11 08:52:52 PDT 2016

On 08/11/2016 11:00 AM, Olaf Alders wrote:
>
>> On Aug 11, 2016, at 10:37 AM, Chris Jones <cj at enersave.ca> wrote:
>>
>>
>> Hello Perl Mongers,
>>
>> I am looking for assistance with a regex. I have a bunch of strings in for form:
>>
>> "01.03.16,,Studio one, Space 22,1         500,500,01.051,,"
>> or
>> ",01.03.16,,Studio one, Space 22,1         500,500,01.051,"
>> or
>> ",01.03.16,,Studio one, Space 22,1         500,500,01.051,,"
>> or
>> ",01.03.16,,Studio one, Space 22, ,01.051,,"
>>
>> So the middle section can be one or more comma separated strings.
>>
>> I am trying to match and return the first non-blank pattern and the last non-blank pattern
>> 01.03.16 and 01.051 – these numbering formats are always the same: xx.xx.xx and yy.yyy
>>
>> So far I have a regex that matches the first pattern:
>>
>> "([0-9]{2})([\.])([0-9]{2})([\.])([0-9]{2})"
>>
>> In any of those above example.
>>
>> I am stuck after that.
>> Any insights appreciated!
>
> I know you're looking for a regex, but you can do this with a split as well, which may be easier to read.
>
> use List::AllUtils qw( first );
>
> my $foo = "01.03.16,,Studio one, Space 22,1         500,500,01.051,,";
> my @foo = split m{,}, $foo;
>
> my $first = first { $_ } @foo;
> my $last  = first { $_ } reverse @foo;
>
> Having said that, it looks like you're maybe parsing a CSV file, in which case just using a CSV parser from CPAN would help catch any corner cases.
>
> Olaf

I would go one step farther than Olaf.  It looks to me that you're 
trying to parse *badly formatted* CSV data.  Each column in well-defined 
CSV data should have a specific real-world meaning.  In your sample 
data, the pattern '(\d{2}\.){2}\d{2}' can appear in either the first or 
second column -- which means that the real-world meaning of either of 
those columns is, at best, ambiguous.

If you're going to need to parse this data over the long haul, you'd 
probably be better off getting the data provider to clean up the data. 
IMO unless and until you are getting really clean data, trying to 
compose a single killer regex is a waste of effort.

If this is one off, then you can take Olaf's suggestions.  Or you could 
pass each record through a series of regexes which trim the extraneous 
fields from the beginning and end of the strings, then use other regexes 
to capture what you need.  That would be slower, but perhaps more 
self-documenting.

Thank you very much.
Jim Keenan