SPUG: using -l, benchmarking perl, $1 overhead, etc.

Sun Feb 22 01:03:00 CST 2004

On Sat, Feb 21, 2004 at 02:45:45PM -0800, "Aaron W. West" <tallpeak at hotmail.com> wrote:
> Is there a perl variable to control "-l"? It just seems unnatural to rely on
> #perl -l to set the option.

I almost always use -l in one-liners (for convenience), and almost
never otherwise.  -l does two things; one is adding an implicit chomp
to the line reading loop created by -n or -p, for which there is no
variable substitue.  The other is setting $\, which you can do
manually as well.  Its setting $\ that is likely to mess up any
module/function that prints and doesn't expect $\ to be set.

Sometimes -l will persist in a script that started out as a one-liner,
but by the time its grown to a couple dozen lines I usually end up
ripping it out.

> I suppose the solution is the use of local for $/ or $\ in those
> modules/functions that need it, and designing all modules to be careful not
> to be sensitive to the -l setting. I'll admit, it gets mildly tedious trying
> to remember to put "\n" at the end of every "print", but what if I want
> somewhere to print without a "\n". I know I can use local in a block:
> 
> $ perl -le '{local ($\)=""; print "a"; print 1;} print 2; print 3'
> 
> a12
> 
> 3
> 
> hmm...
> 
> Maybe I'm just an ol' C programmer used to doing things the hard way, and
> maybe after forgetting chomp or \n a few more times I'll reform and decide
> he's right...

Well, since you are an ol' C programmer, I will note that switching to
printf instead of print causes $\ not to be automatically used:

$ perl -le 'printf "a"; printf "1"; print 2; print 3'
a12
3

> Globals such as $/ always have seemed "wrong" to me. The language "should"
> tie such attributes to the filehandles, or an object. eg:
> 
> open IN,"<myfile";
> 
> IN->record_separator="\n"; # or something like that

Not sure why $\ wasn't made per filehandle (like $., $%, $=, $^, $|),
but it's too late to change it now.  It would have made sense to me.

> How could anyone write a safe multithreaded app in Perl, with global
> variables in different states in different parts of the application? Of
> course, Perl wasn't really designed for multithreading, initially. But it
> supports threading, in recent versions. I imagine perl blocks other threads
> in many situations where globals are used (putting a mutex lock around the
> block or a portion of it), or creates thread-local versions of those
> variables ($1, $2, etc), and saves/restores state when switching threads.

5005threads shared globals, ithreads makes everything thread-local.

> (Okay, everyone can stop here and ignore my ramblings about performance...)
> 
> Perl sure has a lot of ways to do things, eg: (parsing dates):
> 
> 1)
> 
> $purchase_stamp =~ /(\d{4})[^\d]*(\d\d)[^\d]*(\d\d)/ || $error=1;
> 
> $purchase_int = $1 . $2 . $3;
> 
> 2)
> 
> #($pyr,$pmo,$pda)=unpack("A4xA2xA2",$purchase_stamp);
> 
> #$purchase_int="$pyr$pmo$pda";
> 
> 3)
> 
> #$purchase_int = substr($purchase_stamp,0,4) .
> 
> # substr($purchase_stamp,5,2) . substr($purchase_stamp,8,2);
> 
> I decided on (1) the regex approach (slowest), since it would work on dates
> which lack delimiters, should we happen across any.
> 
> Timing shows substr is faster than unpack which is faster than regex.

Those aren't really equivalent.  Your regex is looking for a matching
substring, not checking the whole string.  Try:
/^(\d{4})\D*(\d\d)\D*(\d\d)\z/ (note the ^ and \z anchors, and \D
(which is equivalent to [^\d]).  Also, the regex is the one to use if
you need error checking.

> An example timed command:

Use perl's Benchmark module instead:
=======================================
use Benchmark qw/timethese cmpthese/;
use warnings;
use strict;

my $purchase_stamp = "2004-02-21";

my %test;

$test{regex} = sub {
    my $purchase_int;
    if ($purchase_stamp =~ /(\d{4})[^\d]*(\d\d)[^\d]*(\d\d)/) {
	$purchase_int = $1.$2.$3;
    }
    $purchase_int;
};

$test{regex2} = sub {
    my $purchase_int;
    if ($purchase_stamp =~ /^(\d{4})\D*(\d\d)\D*(\d\d)\z/) {
	$purchase_int = $1.$2.$3;
    }
    $purchase_int;
};

$test{transliterate} = sub {
    my $purchase_int;
    ($purchase_int = $purchase_stamp) =~ tr/0-9//cd;
    $purchase_int;
};

$test{unpck} = sub {
    my $purchase_int;
    my ($pyr,$pmo,$pda)=unpack("A4xA2xA2",$purchase_stamp);
    $purchase_int = $pyr.$pmo.$pda; 
    $purchase_int;
};

$test{unpck2} = sub {
    my $purchase_int;
    $purchase_int = join "", unpack("A4xA2xA2",$purchase_stamp);
    $purchase_int;
};

$test{sbstr} = sub {
    my $purchase_int;
    $purchase_int = substr($purchase_stamp,0,4) .
                    substr($purchase_stamp,5,2) .
                    substr($purchase_stamp,8,2);
    $purchase_int;
};

# test for correct results:

for (keys %test) {
    &{$test{$_}} == 20040221 or die "bad result for $_\n";
}

my $results = timethese(-5, \%test);
cmpthese($results);
====================================================
my output:
Benchmark: running regex, regex2, sbstr, transliterate, unpck, unpck2 for at least 5 CPU seconds...
     regex:  4 wallclock secs ( 5.16 usr +  0.00 sys =  5.16 CPU) @ 104373.98/s (n=538361)
    regex2:  6 wallclock secs ( 5.01 usr +  0.00 sys =  5.01 CPU) @ 114378.87/s (n=572695)
     sbstr:  6 wallclock secs ( 5.33 usr +  0.00 sys =  5.33 CPU) @ 562793.73/s (n=2998565)
transliterate:  7 wallclock secs ( 6.34 usr +  0.00 sys =  6.34 CPU) @ 987529.03/s (n=6258959)
     unpck:  6 wallclock secs ( 5.19 usr +  0.00 sys =  5.19 CPU) @ 69612.76/s (n=361151)
    unpck2:  6 wallclock secs ( 5.40 usr +  0.00 sys =  5.40 CPU) @ 80283.62/s (n=433371)
                  Rate    unpck   unpck2    regex  regex2    sbstr transliterate
unpck          69613/s       --     -13%     -33%    -39%     -88%          -93%
unpck2         80284/s      15%       --     -23%    -30%     -86%          -92%
regex         104374/s      50%      30%       --     -9%     -81%          -89%
regex2        114379/s      64%      42%      10%      --     -80%          -88%
sbstr         562794/s     708%     601%     439%    392%       --          -43%
transliterate 987529/s    1319%    1130%     846%    763%      75%            --
==================================================

When benchmarking, make sure you are using lexicals and/or global variables
just as your real code would--it can make a significant difference.  There
are also often significant differences in benchmarks between perl versions
or between threaded and non-threaded perls.

> More interesting results:
> 
> 1.081us $x =~ /[^\d]*(\d{4}).(\d\d).(\d\d)/;
> 
> 1.612us $x =~ /[^\d]*(\d{4}).(\d\d).(\d\d)/;($a,$b,$c)=("2004","12","29");
> 
> 2.613us $x =~ /[^\d]*(\d{4}).(\d\d).(\d\d)/;($a,$b,$c)=($1,$2,$3)
> 
> It seems that retrieving captured matches takes some time. Perhaps each
> retrieval calls substr, but the difference is actually slower than 3
> substr's and assignments (first command), so maybe perl is doing something
> like optimizing out the whole (meaningless) regex. Assignment to a variable
> from any of $1, $2 or $3 takes 0.66 usec, so I imagine. Assignment of
> ($a,$b,$c)=("2004","12","29") takes much less time than
> ($a,$b,$c)=($1,$2,$3), so the actual assignment is relatively time-consuming
> for some reason.

All magic variables have to go through extra code when their values are read.
They also need to do an extra copy of the value (and may have to allocate a
buffer to put it in).

>From reading through your post, I get the sense you would be a prime
candidate for joining http://perlmonks.org