[Philadelphia-pm] selective splitting?

Brian Duggan bduggan at matatu.org
Fri Nov 18 04:43:13 PST 2016


Hi All,

I'll go ahead and throw in a perl 6 solution:

    $ cat split.pl
    #!/usr/bin/env perl6

    my $v = '20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14';

    grammar log {
        token TOP { [ <outer> | <balanced> ]+ %% ';' }
        token outer { <-[;()]>+ }
        token inner { <-[()]>+ }
        token balanced { [ <outer>? '(' <inner> ')' <outer>? ] + }
    }

    say log.parse($v);

And the output:

    $ perl6 split.pl
    「20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14」
    outer => 「20161116172606Z」
    outer => 「accepted-terms-of-use via CAS」
    outer => 「192.168.1.5」
    balanced => 「Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14」
        outer => 「Mozilla/5.0 」
        inner => 「Macintosh; Intel Mac OS X 10_12_1」
        outer => 「 AppleWebKit/602.2.14 」
        inner => 「KHTML, like Gecko」
        outer => 「 Version/10.0.1 Safari/602.2.14」

Brian



On Friday, November 18, Morgan Jones wrote: 
> Nate,
> 
> That’s an elegant and simple solution, thanks.  It’s also much more readable than what I was working on.  I’ll integrate it tomorrow.
> 
> -morgan
> 
> 
> > On Nov 17, 2016, at 21:40, Nate Smith <nate at perlhack.com> wrote:
> > 
> > 
> > Hi Morgan,
> > 
> > I totes agree re: peer review!
> > 
> > Lookaround assertions are what I'd reach for first for your problem, too, but I think they fall short:
> > 
> > my $v = '20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14';
> > my @naive_parts = split /;/, $v;
> > my @parts = split /(?<!\(.+);(?!.+\))/, $v;
> > map { print "$_\n" } @parts;
> > 
> > If you run that, it'll say 
> > 
> >  Variable length lookbehind not implemented in regex m/(?<!\(.+);(?!.+\))/
> > 
> > So my understanding is that the RE engine can't validate a variable width look-behind assertion, though I don't know why.
> > 
> > Workarounds people have come up with are using the '\K' escape (see perldoc perlre), or reversing the string and doing a look-ahead instead!
> > 
> > I've never used the '\K' method and don't understand it.   Reversing the string won't work for you b/c you want both look-ahead /and/ look-behind in the same re.
> > 
> > Given all of that, my brain wants to treat this as a two step process like a compiler might.
> > 
> > 1) using either another regex or the range operator[s], substitute a placeholder for all the semicolons that are inside parens
> > 2) perform your split with a dead simple split regex, /;/
> > 3) replace the placeholders with semicolons on each part after it's been split
> > 
> > See attached sample code!
> > 
> > Cheers,
> > Nate
> > 
> > PS Nice meeting you all on Monday!
> > 
> > On Thu, Nov 17, 2016 at 08:40:37PM -0500, Morgan Jones wrote:
> >> mjd’s talk Monday has me thinking about peer review and how helpful it can be.  So here goes.  I can certainly work around this but as a learning experience I’m wondering if someone has a straightforward answer. Can I split on only instances of a character that is not surrounded by in this case parentheses?
> >> 
> >> I have a semicolon separated string that contains a date, a string, an ip address and a user agent string.  The catch is the user agent string contains a semicolon however it’s between parentheses.  So what I want is to split on semicolons that are not surrounded by parentheses.
> >> 
> >> For example:
> >> $v = ‘20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14’;
> >> 
> >> It seems to me I should be able to split like this:
> >> my ($date, $ignore, $ip, $agent) = split /[^\(]+[^\;]*\;[^\)]*[^\)]+/, $v;
> >> 
> >> From a little reading I may need to use look aheads which are new to me.  Here’s an attempt at that that is of course not working:
> >> my ($date, $ignore, $ip, $agent) = 
> >> 	    	split /(?<!()
> >>                       \;
> >>                       (?!))/x, $v;
> >> 
> >> 
> >> Does anyone have a suggestion or see what I’m missing?
> >> 
> >> thanks,
> >> 
> >> -morgan
> >> _______________________________________________
> >> Philadelphia-pm mailing list
> >> Philadelphia-pm at pm.org
> >> http://mail.pm.org/mailman/listinfo/philadelphia-pm
> > <morgan.pl.txt>
> 
> _______________________________________________
> Philadelphia-pm mailing list
> Philadelphia-pm at pm.org
> http://mail.pm.org/mailman/listinfo/philadelphia-pm


More information about the Philadelphia-pm mailing list