LPM: File reading performance

llang at baywestpaper.com llang at baywestpaper.com
Wed Jan 19 07:38:50 CST 2000


Avoid the joining step altogether.  Slurp it.

          undef $/;
          $whole=<FILE>;

(Explained in much more detail in the perlvar man page...)


Loren Lang               Phone:    606-734-0538 x326
Network Administrator         Fax: 606-734-8210
Bay West Paper Corporation    email:     llang at baywestpaper.com

"There is no greater burden than great potential." - Charles Schultz







I got impatient with the time required to read in a big (7 Mb) file into a
scalar variable, so I did a little benchmarking on my code.  (I'm exposing
myself before my peers, so be gentle...)

My code started out as:

           while (<FILE>) {
                     $whole .= $_;
           }

When I started this project, I was reading 7 Kb files, and didn't have a
care in the world.  (I'm sure the Perl wizards out there can see where this
is going.)

Obviously, that's a lot of string manipulation.  I klutzed around with join
and came up with the following.  (Okay...I just admitted it...I've never
used join before.  There...it's out.)


           @content = <FILE>;
           $whole = join ("", @content);

I figured it would probably be faster, with less string manipulations, but
maybe not because of the array manipulations.  Before I benchmarked the
two, I assumed the file access would be the same for either, and that would
dwarf any memory-based processing...thus making it irrelevant which
approach I used.

Wrong!  The concatenation method took around 90 seconds over several
trials.  The join method clocked in at 5-6 seconds a pop.  That's a *huge*
difference!

I am guessing that a 20-fold performance boost could only come from file
access issues.  I suspect the             @content = <FILE> actually reads
the whole
file in one read (or perhaps a few reads with a large buffer), where
while(<FILE>) is issuing a separate read to the OS for each of the 40489
lines of the file.

I'm also assuming file caching is not coming into play, as I repeated
several trials in different orders with consistent results.

Is there a "conventional" method to slurp an entire file into a string?  It
sure would be nice to get another 20-fold performance increase!  (greedy, I
know...)  I also realize I'm doubling the amount of memory needed to hold
my file in memory.  I guess I could make @content a my variable within a
small block.  This would release it as soon as the block ended, right?


...later on...

I actually need to get this file read in as a single line, with \n's
removed.  I compared:

           chomp @content;
           $whole = join ("", @content);

with:

           $whole = join ("", @content);
           $whole =~ s/\n//g;

With a resolution of one second, I could not determine if either is faster
than the other.  Given that either takes less than a second to handle 40489
lines, I guess it doesn't matter too much.  I chose the chomp method in the
end, just because it seems more hygienic.


Any pros care to comment on any of my methods or assumptions?

-dave


ps.  FWIW, below is my home-brew benchmarking code...nothing to brag about,
I know.  I tried running the two loops in the same program, but the latter
of the two benefited from caching, so I had to split them up into two
different programs.




$fn = shift;
$whole = "";
$start= time();
open FILE, $fn       or die "Can't open $fn.";
while (<FILE>) {
           $whole .= $_;
}
close FILE;
print "length of file is ", length($whole), " bytes.\n";
print "Took " . (time - $start) . " seconds to read file appending to
string. \n";



$fn = shift;
$start= time();
open FILE, $fn       or die "Can't open $fn.";
@content = <FILE>;
close FILE;
$whole = join ("", @content);
print scalar @content . " Lines\n";;
print "length of file is ", length($whole), " bytes.\n";
print "Took " . (time - $start) . " seconds to read file into array and
join. \n";






--
David Hempy
Internet Database Administrator
Kentucky Educational Television










More information about the Lexington-pm mailing list