LPM: Sorting Big Files
Mik Firestone
fireston at lexmark.com
Wed Dec 15 10:27:02 CST 1999
How about doing it this way:
1. Scan the file once, creating a hash keyed by the byte offset of the
record in the file and the values would be the country code.
2. Sort the hash first by country code and then location in the file.
3. Use seeks to access each record in sorted order - it will come out by
country code, and the secondary sort should make reading faster as you
will not have to backtrack through the file.
In quick code fashion, it would look something like this:
%index = ();
open FILE, "file";
$/ = \1300; # Works for perl 5.005 or better
while ( $rec = <FILE> ) {
if ( $rec =~ /some clever regex to extract and save the (country_code)/ ) {
$index{ tell( <FILE> ) } = $1;
}
else {
$index{ tell( <FILE> ) } = 'zzzz'; # Force these to be last
}
}
unless ( seek(FILE,0,0) ) {
die "Couldn't rewind the file";
}
for $idx ( sort { $index{$a} cmp $index{$b} ||
$a <=> $b } keys %index ) {
unless ( seek( FILE,$idx,0 ) ) {
die "Couldn't seek to position $idx";
}
$record = <FILE>;
# do whatever to record and then print it out
}
Note: when I set the input record seperator ( $/ = \1300 ), I am telling the
read operator to read 1300 bytes. This is a new feature of perl 5.005.
If you are running earlier versions of perl, you will need to use sysread
instead.
This code is *untested* and is offered as is, I am not responsible if it
causes global warming or anything else, blah blah blah blah blah.
HTH,
Mik
--
Mik Firestone fireston at lexmark.com
When I become an Evil Overlord:
My Legions of Terror will be trained in basic marksmanship. Any who cannot
learn to hit a man-sized target at 10 meters will be used for target practice.
More information about the Lexington-pm
mailing list