[Chicago-talk] @ARGV while(<>)

Sun Jan 13 16:12:15 PST 2008

> You're doing repeated concatenation onto a string, which gets bigger
each time through the loop.
> This might mean a lot of realloc's and copying of the string in
memory.  (I may be wrong, I haven't
> studied perl internals).
>
> Try putting the lines into an array and then join'ing them at the very
end.
>
> while (<>)
> {
>      push (@lines, $_) unless (/espf\[/);
> }
> my $body = join("", @lines);

Perl strings are C strings with a start pointer and size_t
length. Eating into the start simply updates the starting
pointer; cutting off the end reduces the length.

So, yes, $a .= $b will become expensive.

Catch: It is equally expenseive in the array case
since you end up having to re-allocate the array
when it grows. Not quite as much coping, but likley
the number of lines in the file.

Ways to get around this include:

- Presize the array and grow it in chunks
  as it progresses. This will reduce the
  number of copy operations to manageable
  level (e.g., 1_000 line chunks):

    my @bufferz = ();

    $#bufferz   = 1_000;

    my $i = 0;

    while( <ARGV> )
    {
        # clean up $_;

        $bufferz[ $i ] = $_;

        if( ++$i > $#bufferz )
        {
            $#bufferz   += 1_000;

            $size = $#bufferz;
        }
    }

- Write the stuff out as it is processed and
  read it back with a single array/slurp read:

    open my $tmpfile, '+>', '/var/tmp/$$.tmp'
    or die ...;

    while( <ARGV> )
    {
        ...

        print $tmpfile $_;
    }

    seek $tmpfile, 0, 0;

    my @linz    = <$tmpfile>

    ...

    # or my $linz = do { local $/; <$tmpfile> }

-- 
Steven Lembark                                            85-09 90th St.
Workhorse Computing                                 Woodhaven, NY, 11421
lembark at wrkhors.com                                      +1 888 359 3508