LPM: Sorting Big Files

Joe Hourcle oneiros at dcr.net
Wed Dec 15 09:50:42 CST 1999



On Wed, 15 Dec 1999, Rich Bowen wrote:

> Janine Ladick wrote:
> > 
> > Hello, List!
> > 
> > Does anyone have a smooth way of sorting large text files?  (By
> > "large" I mean "in excess of 1GB" and by "text files" I mean "flat
> > ASCII database files.")
> 
> Step 1: Get 2GB of RAM
> Step 2: Read the file into an array and call sort() on it.
> Step 3: Go read a good book
> 
> OK, all kidding aside ...
> 
> Uri Guttman and Larry Rosler presented a paper at the Perl Conference
> about sorting. There is nothing in that paper about huge text files, but
> they might have some insight into this.
> 
> I know that there are sorts that don't require that you have the whole
> data set in memory at a time I suspect that they are slower than (insert
> favorite slow thing) but I guess that sorting a gig of data is going to
> be slow no matter what.
> 
> I think that my first step would be to run through the data file and
> insert each entry into a real database, and then select it back out of
> the database in sorted order. My suspicion is that that would be the
> most efficient thing to do, even if you don't use the datase for storing
> the data. A million records is not very many, when talking about a real
> database, but it's a boatload when you're just trying to deal with text
> files.

Well, I've never done this before, but it would seem to me, that if you're
only sorting my country, and no other field, then you only need enough
memory to keep two copies of the list of countries in memory at once...

if i had more time, I'd write up the code, but here's the procedures I
would try:

	make a list of all of the countries in the file
	sort the list of countries
	for each country in the list
		copy the matching items from the input to output file


Of course, this will require (size of file)x2 in drive space, but HD space
is cheaper than memory these days.

-Joe




More information about the Lexington-pm mailing list