LPM: File reading performance
llang at baywestpaper.com
llang at baywestpaper.com
Wed Jan 19 07:38:50 CST 2000
Avoid the joining step altogether. Slurp it.
undef $/;
$whole=<FILE>;
(Explained in much more detail in the perlvar man page...)
Loren Lang Phone: 606-734-0538 x326
Network Administrator Fax: 606-734-8210
Bay West Paper Corporation email: llang at baywestpaper.com
"There is no greater burden than great potential." - Charles Schultz
I got impatient with the time required to read in a big (7 Mb) file into a
scalar variable, so I did a little benchmarking on my code. (I'm exposing
myself before my peers, so be gentle...)
My code started out as:
while (<FILE>) {
$whole .= $_;
}
When I started this project, I was reading 7 Kb files, and didn't have a
care in the world. (I'm sure the Perl wizards out there can see where this
is going.)
Obviously, that's a lot of string manipulation. I klutzed around with join
and came up with the following. (Okay...I just admitted it...I've never
used join before. There...it's out.)
@content = <FILE>;
$whole = join ("", @content);
I figured it would probably be faster, with less string manipulations, but
maybe not because of the array manipulations. Before I benchmarked the
two, I assumed the file access would be the same for either, and that would
dwarf any memory-based processing...thus making it irrelevant which
approach I used.
Wrong! The concatenation method took around 90 seconds over several
trials. The join method clocked in at 5-6 seconds a pop. That's a *huge*
difference!
I am guessing that a 20-fold performance boost could only come from file
access issues. I suspect the @content = <FILE> actually reads
the whole
file in one read (or perhaps a few reads with a large buffer), where
while(<FILE>) is issuing a separate read to the OS for each of the 40489
lines of the file.
I'm also assuming file caching is not coming into play, as I repeated
several trials in different orders with consistent results.
Is there a "conventional" method to slurp an entire file into a string? It
sure would be nice to get another 20-fold performance increase! (greedy, I
know...) I also realize I'm doubling the amount of memory needed to hold
my file in memory. I guess I could make @content a my variable within a
small block. This would release it as soon as the block ended, right?
...later on...
I actually need to get this file read in as a single line, with \n's
removed. I compared:
chomp @content;
$whole = join ("", @content);
with:
$whole = join ("", @content);
$whole =~ s/\n//g;
With a resolution of one second, I could not determine if either is faster
than the other. Given that either takes less than a second to handle 40489
lines, I guess it doesn't matter too much. I chose the chomp method in the
end, just because it seems more hygienic.
Any pros care to comment on any of my methods or assumptions?
-dave
ps. FWIW, below is my home-brew benchmarking code...nothing to brag about,
I know. I tried running the two loops in the same program, but the latter
of the two benefited from caching, so I had to split them up into two
different programs.
$fn = shift;
$whole = "";
$start= time();
open FILE, $fn or die "Can't open $fn.";
while (<FILE>) {
$whole .= $_;
}
close FILE;
print "length of file is ", length($whole), " bytes.\n";
print "Took " . (time - $start) . " seconds to read file appending to
string. \n";
$fn = shift;
$start= time();
open FILE, $fn or die "Can't open $fn.";
@content = <FILE>;
close FILE;
$whole = join ("", @content);
print scalar @content . " Lines\n";;
print "length of file is ", length($whole), " bytes.\n";
print "Took " . (time - $start) . " seconds to read file into array and
join. \n";
--
David Hempy
Internet Database Administrator
Kentucky Educational Television
More information about the Lexington-pm
mailing list