[Philadelphia-pm] selective splitting?

Thu Nov 17 18:40:12 PST 2016

Hi Morgan,

I totes agree re: peer review!

Lookaround assertions are what I'd reach for first for your problem, too, but I think they fall short:

my $v = '20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14';
my @naive_parts = split /;/, $v;
my @parts = split /(?<!\(.+);(?!.+\))/, $v;
map { print "$_\n" } @parts;

If you run that, it'll say 

  Variable length lookbehind not implemented in regex m/(?<!\(.+);(?!.+\))/

So my understanding is that the RE engine can't validate a variable width look-behind assertion, though I don't know why.

Workarounds people have come up with are using the '\K' escape (see perldoc perlre), or reversing the string and doing a look-ahead instead!

I've never used the '\K' method and don't understand it.   Reversing the string won't work for you b/c you want both look-ahead /and/ look-behind in the same re.

Given all of that, my brain wants to treat this as a two step process like a compiler might.

1) using either another regex or the range operator[s], substitute a placeholder for all the semicolons that are inside parens
2) perform your split with a dead simple split regex, /;/
3) replace the placeholders with semicolons on each part after it's been split

See attached sample code!

Cheers,
Nate

PS Nice meeting you all on Monday!

On Thu, Nov 17, 2016 at 08:40:37PM -0500, Morgan Jones wrote:
> mjd’s talk Monday has me thinking about peer review and how helpful it can be.  So here goes.  I can certainly work around this but as a learning experience I’m wondering if someone has a straightforward answer. Can I split on only instances of a character that is not surrounded by in this case parentheses?
> 
> I have a semicolon separated string that contains a date, a string, an ip address and a user agent string.  The catch is the user agent string contains a semicolon however it’s between parentheses.  So what I want is to split on semicolons that are not surrounded by parentheses.
> 
> For example:
> $v = ‘20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14’;
> 
> It seems to me I should be able to split like this:
> my ($date, $ignore, $ip, $agent) = split /[^\(]+[^\;]*\;[^\)]*[^\)]+/, $v;
> 
> From a little reading I may need to use look aheads which are new to me.  Here’s an attempt at that that is of course not working:
> my ($date, $ignore, $ip, $agent) = 
> 	    	split /(?<!()
>                        \;
>                        (?!))/x, $v;
> 
> 
> Does anyone have a suggestion or see what I’m missing?
> 
> thanks,
> 
> -morgan
> _______________________________________________
> Philadelphia-pm mailing list
> Philadelphia-pm at pm.org
> http://mail.pm.org/mailman/listinfo/philadelphia-pm
-------------- next part --------------
#!/usr/bin/perl
use Data::Dumper;

my $v = '20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14';

my @parens_parts = $v =~ m/(\(.+?\))/g;
print Dumper \@parens_parts;
print "\n";

my $fake_semicolon = '#SEMICOLON#';
map {
    if ( m/;/ ) {
        my $orig = $_;
        $_ =~ s/;/$fake_semicolon/g;
        $v =~ s/$orig/$_/;
    }
} @parens_parts;
# Now $v has your fake_semicolon in place of all the troublesome semicolons:
print "$v\n\n";

# Now splitting is trivial!
my @naive_parts = split /;/, $v;
#my @parts = split /(?<!\(.+);(?!.+\))/, $naive;

# We just have to remember to undo the placeholders:
map {
    $_ =~ s/$fake_semicolon/\;/g;
    print "$_\n"
} @naive_parts;