[San-Diego-pm] Recompiling global substitution RE?

Sun Mar 21 01:19:00 PDT 2010

Reuben Settergren <ruberad at gmail.com> writes:
> [...] is there any way I can force the *left* side to compile
> multiple times? (Even within the same /g substitution?) I tried a
> few constructs using the \G (start matching from previous position)
> zero-width assertion, but only managed to get myself into infinite
> loops.

It's doable, but it gets ugly to do it flexibly.

First, a simple (hardcoded) proof of concept:

| #!/usr/bin/perl
| 
| use warnings;
| use strict;
| 
| my $orig = "foo1 foo2 foo2 foo3 foo4 foo2 foo5";
| print "orig: $orig\n";
| 
| my $done = '';
| my $n = 1;
| while ( $orig =~ m!\G(.*?)(foo$n)!gc )
| {
|     $done .= $1;
|     $done .= 'bar' . $n if defined $2;
|     ++$n;
| }
| 
| print "done: $done\n";
| 
| exit 0;

(Online at: http://scrye.com/~tkil/perl/simple-recomp-replace.plx )

Output:

| $ ./simple-recomp-replace.plx 
| orig: foo1 foo2 foo2 foo3 foo4 foo2 foo5
| done: bar1 bar2 foo2 bar3 bar4 foo2 bar5

Making that generic... I probably didn't do a great job.  It might be
cleaner if it's implemented as a class; I don't know.  Here's what I
came up with:

| #!/usr/bin/perl
| 
| use warnings;
| use strict;
| 
| =over
| 
| =item my $done = recomp_replace $orig, $sub_ref;
| 
| This function provides a relatively generic way to do a global search
| and replace while allowing the matching regex to vary.
| 
| =over
| 
| =item $orig
| 
| This is the original string.
| 
| =item $sub_ref
| 
| This is a callback which is fed each matching portion of the $orig
| string.  Its return value is a list of two items: first, the text to
| substitute into the final string; and second, the regex to use to
| match the next chunk.
| 
| A special case is when this is called without any arguments; this is
| used to obtain the starting regex (and the return text is discarded).
| 
| =back
| 
| Example:
| 
|   my $orig = "foo1 foo2 foo2 foo3 foo4 foo2 foo5";
|   print "orig: $orig\n";
| 
|   my $n = 0;
|   sub numbered_foo
|   {
|       my ( $chunk ) = @_;
|       return ( "bar" . $n++, "foo$n" );
|   }
| 
|   my $done = recomp_replace $orig, \&numbered_foo;
|   print "done: $done\n";
| 
| This generates the following output:
| 
|   orig: foo1 foo2 foo2 foo3 foo4 foo2 foo5 baz
|   done: bar1 bar2 foo2 bar3 bar4 foo2 bar5 baz
| 
| For a more complicated example, consider a mini-language that reads in
| a mixed list of words and simple directives.  This time, the output
| first (spacing has been manually adjusted for clarity):
| 
|   orig: one=>alpha two one   three one   two=>beta one two  three
|   done: alpha      two alpha three alpha beta      one beta three
| 
| And here's the engine behind it:
| 
|   my $map_re = qr/ ( \w+ ) => ( \w+ ) /x;
|   my $last_re = $map_re;
|   my ( $src, $dest );
|   sub selective_map
|   {
|       my ( $chunk ) = @_;
|       my $out;
|       if ( ! defined $chunk )
|       {
|           $out = '';
|       }
|       elsif ( defined $src && $chunk eq $src )
|       {
|           $out = $dest;
|       }
|       elsif ( $chunk =~ /^$map_re$/ )
|       {
|           ( $src, $dest ) = ( $1, $2 );
|           $last_re = qr/ $map_re | $src /x;
|           $out = $dest;
|       }
|       return ( $out, $last_re );
|   }
| 
| =back
| 
| =cut
| 
| sub recomp_replace
| {
|     my ( $orig, $block ) = @_;
| 
|     my $done = '';
|     my ( $chunk, $re ) = $block->();
|     # print "re='$re'\n";
|     my $pos = 0;
|     while ( $orig =~ m! \G ( .*? ) ( $re ) !gcx )
|     {
|         $done .= $1;
|         ( $chunk, $re ) = $block->( $2 );
|         # print "chunk='$chunk', re='$re'\n";
|         $done .= $chunk;
|         $pos = $+[0];
|     }
|     $done .= substr $orig, $pos;
|     return $done;
| }
| 
| # first example
| {
|     my $orig = "foo1 foo2 foo2 foo3 foo4 foo2 foo5 baz";
|     print "orig: $orig\n";
| 
|     my $n = 0;
|     sub numbered_foo
|     {
|         my ( $chunk ) = @_;
|         return ( "bar" . $n++, "foo$n" );
|     }
| 
|     my $done = recomp_replace $orig, \&numbered_foo;
|     print "done: $done\n";
| }
| 
| # second example
| {
|     my $orig = "one=>alpha two one three one two=>beta one two three";
|     print "orig: $orig\n";
| 
|     my $map_re = qr/ ( \w+ ) => ( \w+ ) /x;
|     my $last_re = $map_re;
|     my ( $src, $dest );
|     sub selective_map
|     {
|         my ( $chunk ) = @_;
|         my $out;
|         if ( ! defined $chunk )
|         {
|             $out = '';
|         }
|         elsif ( defined $src && $chunk eq $src )
|         {
|             $out = $dest;
|         }
|         elsif ( $chunk =~ /^$map_re$/ )
|         {
|             ( $src, $dest ) = ( $1, $2 );
|             $last_re = qr/ $map_re | $src /x;
|             $out = $dest;
|         }
|         return ( $out, $last_re );
|     }
| 
|     my $done = recomp_replace $orig, \&selective_map;
|     print "done: $done\n";
| }
| 
| exit 0;

(Online at:  http://scrye.com/~tkil/perl/recomp-replace.plx )

Hm.  Just realized that the second version will prolly explode if the
regex returned from the callback ever matches zero characters (since
my "get stuff before the RE" is non-greedy...)  Oh well.  Don't do
that, then.

"Share and Enjoy",
t.