LPM: File reading performance

David Hempy hempy at ket.org
Wed Jan 19 01:20:32 CST 2000

I got impatient with the time required to read in a big (7 Mb) file into a
scalar variable, so I did a little benchmarking on my code.  (I'm exposing
myself before my peers, so be gentle...)

My code started out as:

	while (<FILE>) {
		$whole .= $_;

When I started this project, I was reading 7 Kb files, and didn't have a
care in the world.  (I'm sure the Perl wizards out there can see where this
is going.)

Obviously, that's a lot of string manipulation.  I klutzed around with join
and came up with the following.  (Okay...I just admitted it...I've never
used join before.  There...it's out.)

	@content = <FILE>;
	$whole = join ("", @content);

I figured it would probably be faster, with less string manipulations, but
maybe not because of the array manipulations.  Before I benchmarked the
two, I assumed the file access would be the same for either, and that would
dwarf any memory-based processing...thus making it irrelevant which
approach I used.

Wrong!  The concatenation method took around 90 seconds over several
trials.  The join method clocked in at 5-6 seconds a pop.  That's a *huge*

I am guessing that a 20-fold performance boost could only come from file
access issues.  I suspect the 	@content = <FILE> actually reads the whole
file in one read (or perhaps a few reads with a large buffer), where
while(<FILE>) is issuing a separate read to the OS for each of the 40489
lines of the file.

I'm also assuming file caching is not coming into play, as I repeated
several trials in different orders with consistent results.

Is there a "conventional" method to slurp an entire file into a string?  It
sure would be nice to get another 20-fold performance increase!  (greedy, I
know...)  I also realize I'm doubling the amount of memory needed to hold
my file in memory.  I guess I could make @content a my variable within a
small block.  This would release it as soon as the block ended, right?

...later on...

I actually need to get this file read in as a single line, with \n's
removed.  I compared:

	chomp @content;
	$whole = join ("", @content);


	$whole = join ("", @content);
	$whole =~ s/\n//g;

With a resolution of one second, I could not determine if either is faster
than the other.  Given that either takes less than a second to handle 40489
lines, I guess it doesn't matter too much.  I chose the chomp method in the
end, just because it seems more hygienic.

Any pros care to comment on any of my methods or assumptions?


ps.  FWIW, below is my home-brew benchmarking code...nothing to brag about,
I know.  I tried running the two loops in the same program, but the latter
of the two benefited from caching, so I had to split them up into two
different programs.

$fn = shift;
$whole = "";
$start= time();
open FILE, $fn	or die "Can't open $fn.";
while (<FILE>) {
	$whole .= $_;
close FILE;
print "length of file is ", length($whole), " bytes.\n";
print "Took " . (time - $start) . " seconds to read file appending to
string. \n";

$fn = shift;
$start= time();
open FILE, $fn	or die "Can't open $fn.";
@content = <FILE>;
close FILE;
$whole = join ("", @content);
print scalar @content . " Lines\n";;
print "length of file is ", length($whole), " bytes.\n";
print "Took " . (time - $start) . " seconds to read file into array and
join. \n";

David Hempy
Internet Database Administrator
Kentucky Educational Television

More information about the Lexington-pm mailing list