LPM: Sorting Big Files

Mik Firestone fireston at lexmark.com
Wed Dec 15 10:27:02 CST 1999


How about doing it this way:
  1. Scan the file once, creating a hash keyed by the byte offset of the
     record in the file and the values would be the country code.
  2. Sort the hash first by country code and then location in the file.
  3. Use seeks to access each record in sorted order - it will come out by
     country code, and the secondary sort should make reading faster as you
     will not have to backtrack through the file.

In quick code fashion, it would look something like this:

%index = ();
open FILE, "file";
$/ = \1300;       # Works for perl 5.005 or better
while ( $rec = <FILE> ) {
    if ( $rec =~ /some clever regex to extract and save the (country_code)/ ) {
      $index{ tell( <FILE> ) } = $1;
    }
    else {
      $index{ tell( <FILE> ) } = 'zzzz';  # Force these to be last
    }
}

unless ( seek(FILE,0,0) ) {
    die "Couldn't rewind the file";
}

for $idx ( sort { $index{$a} cmp $index{$b} ||
		  $a <=> $b } keys %index ) {
    unless ( seek( FILE,$idx,0 ) ) {
	die "Couldn't seek to position $idx";
    }
    $record = <FILE>;

    # do whatever to record and then print it out
}

Note: when I set the input record seperator ( $/ = \1300 ), I am telling the
    read operator to read 1300 bytes.  This is a new feature of perl 5.005.
    If you are running earlier versions of perl, you will need to use sysread
    instead.
    
    This code is *untested* and is offered as is, I am not responsible if it
    causes global warming or anything else, blah blah blah blah blah.

HTH,
Mik

-- 
Mik Firestone fireston at lexmark.com
When I become an Evil Overlord:
My Legions of Terror will be trained in basic marksmanship. Any who cannot
learn to hit a man-sized target at 10 meters will be used for target practice.




More information about the Lexington-pm mailing list