DCPM: splitting on lookaheads - verbose

Fri Oct 31 07:37:34 CST 2003

> sub border;

Pre-declare the subroutine so it doesn't clog the top of my code.

> { local $/; $string = <DATA> }

I use a block here to make sure that the $/ variable only retains it's
new value for the duration of the read. If I tried to do this in the
main scope, I would have to remember the old value of $/, then set it,
read and set it back. Since that's what local does, that's why I chose
a block.

This is one of the "Seven Useful Uses of Local", a Perl Journal
article which can be found at: http://perl.plover.com/local.html

The $/ variable is the "input record separator", which effectively
means "that which the input is split on". By default this is newline,
and hence we get lines from files when we use it in its default
state. 

Setting $/ to undef, which is the effect of decalring it without
value, makes whole contents of the file come in at once, since there
is no record separator, the first record is the whole file.

You'll find another special case (paragraph splitting) in perlvar.

DATA is a special file handle which, as you can probably see, read
from that which follows __DATA__. This allowed me to embed my data in
the example. How very BASIC.

> print border('=',$string);

Pretty print.

> @records = split (/(?=^fred)/m, $string);

Here, I want to split the string up, however, I don't want to retain
the thing which starts my record (fred).

If I'd used:

split (/fred/, $string);

I'd have got:

----------
----------
----------
 1
dsa
dsa
----------
----------
 2
sdaf
dsa
fsda
dsa
----------

Here, not only do I lose 'fred' but I also get an empty record at the
beginning, since in this context, it's a separator, ie. the thing that
strictly goes between records.

So, I use a "zero-width positive look-ahead assertion". That is to say
that the regex engine looks for it, but doesn't include it in the
match. This is why it's retained.

So, why don't I get an empty record at the beginning still? Well, that
only happens when the match is of positive width. See perldoc -f split

OK, so, why the ^ ? Well, clearly, if my data was:

__DATA__
fred 1
dsa
dsa
fred 2
sdaf
dsa fred
fsda
dsa

I'd get from: split (/(?=fred)/, $string);

----------
fred 1
dsa
dsa
----------
----------
fred 2
sdaf
dsa ----------
----------
fred
fsda
dsa
----------

This is clearly not what I want, so I need to match fred at the
beginning of the line.

However, split (/(?=^fred)/, $string); gives:

----------
fred 1
dsa
dsa
fred 2
sdaf
dsa fred
fsda
dsa
----------

So, the final /m gives me the "multi-line match" I need. This gives me
a "beginning of line" at the beginning of every line, not just the
begeinning of the string.

> print map { border('-',$_) } @records;

Pretty print again.

> exit;

Habit.

>  sub border {
>   
>   my ($mark,$string) = @_;
>   
>   $mark x 10,"\n",$string,$mark x 10,"\n";
> }

Though simple, this look cack in the code and hence the sub.

> __DATA__
> fred 1
> dsa
> dsa
> fred 2
> sdaf
> dsa fred
> fsda
> dsa

The data, ammended with extra fred.

Questions are, of course, always welcome.

Steve

PS AFAIK zero-width stuff is a Perl extension to regular expressions,
   though it may have been stolen by other languages.