LPM: Sorting Big Files
Joe Hourcle
oneiros at dcr.net
Wed Dec 15 09:50:42 CST 1999
On Wed, 15 Dec 1999, Rich Bowen wrote:
> Janine Ladick wrote:
> >
> > Hello, List!
> >
> > Does anyone have a smooth way of sorting large text files? (By
> > "large" I mean "in excess of 1GB" and by "text files" I mean "flat
> > ASCII database files.")
>
> Step 1: Get 2GB of RAM
> Step 2: Read the file into an array and call sort() on it.
> Step 3: Go read a good book
>
> OK, all kidding aside ...
>
> Uri Guttman and Larry Rosler presented a paper at the Perl Conference
> about sorting. There is nothing in that paper about huge text files, but
> they might have some insight into this.
>
> I know that there are sorts that don't require that you have the whole
> data set in memory at a time I suspect that they are slower than (insert
> favorite slow thing) but I guess that sorting a gig of data is going to
> be slow no matter what.
>
> I think that my first step would be to run through the data file and
> insert each entry into a real database, and then select it back out of
> the database in sorted order. My suspicion is that that would be the
> most efficient thing to do, even if you don't use the datase for storing
> the data. A million records is not very many, when talking about a real
> database, but it's a boatload when you're just trying to deal with text
> files.
Well, I've never done this before, but it would seem to me, that if you're
only sorting my country, and no other field, then you only need enough
memory to keep two copies of the list of countries in memory at once...
if i had more time, I'd write up the code, but here's the procedures I
would try:
make a list of all of the countries in the file
sort the list of countries
for each country in the list
copy the matching items from the input to output file
Of course, this will require (size of file)x2 in drive space, but HD space
is cheaper than memory these days.
-Joe
More information about the Lexington-pm
mailing list