SPUG: Sorting a big file

Aaron W. West tallpeak at hotmail.com
Fri Aug 27 16:52:21 CDT 2004


At my prior workplace (I won't mention the name, but the stock symbol is
AMZN), it was part of a FAQ that GNU sort is more efficient than Perl's
sort, so should be used instead for large files. Of course, it's already
been mentioned, so I'm just replying to add:

List::SkipList ... is recommended for medium-sized sorting tasks. Very close
to O(n)

$ perl sklist.pl
n=    15 elapsed:  0.001, factor= 0.00
n=    62 elapsed:  0.001, factor= 1.00
n=   250 elapsed:  0.006, factor= 6.00
n=  1000 elapsed:  0.028, factor= 4.67
n=  4000 elapsed:  0.124, factor= 4.43
n= 16000 elapsed:  0.628, factor= 5.06
n= 64000 elapsed:  2.886, factor= 4.60
n=256000 elapsed: 11.859, factor= 4.11

See http://search.cpan.org/~rrwo/
-->
http://search.cpan.org/~rrwo/List-SkipList-0.73_01/
    02 Aug 2004 ** DEVELOPER RELEASE **


----- Original Message ----- 
From: "Kurt Buff" <KBuff at zetron.com>
To: "'Dan Ebert'" <mathin at mathin.com>
Cc: <spug-list at mail.pm.org>
Sent: Friday, August 27, 2004 2:33 PM
Subject: RE: SPUG: Sorting a big file


When you do the merge, you compare across the various files.

If, for instance, you break it into 3 files a, b, c, you'll sort each file,
take a line from a, compare it to the next line for b and c, and write the
correct one to the result file.

lather, rinse, repeat.

-----Original Message-----
From: spug-list-bounces at mail.pm.org
[mailto:spug-list-bounces at mail.pm.org]On Behalf Of Dan Ebert
Sent: Friday, August 27, 2004 14:00
Cc: spug-list at mail.pm.org
Subject: Re: SPUG: Sorting a big file



I had thought of spliting the file, but I don't think this would work
if a later section had lines that really should be at the top of the sort.

SPLIT DATA:
12
14
16
13

34
54
21
10

would create a file:

12
13
14
16
10
21
34
54

which really isn't sorted.

It looks like the UNIX sort command is working on the whole file though.
I didn't know that command.  Thanks to everyone who pointed it out to me.

Dan.
----------------------------------------------------------
Immigration is the sincerest form of flattery.
- Unknown
----------------------------------------------------------


On Fri, 27 Aug 2004, Brian Hatch wrote:

>
>
> >
> > I have a large file (~1 million lines, ~142MB) which I need to sort (any
> > order is fine, just so the lines are in a repeatable order).
> >
> > Just using perl's 'sort' on the file read into an array eats up all the
> > RAM and swap on my box and crashes.  I'm also trying tying the file as
an
> > array, but it looks like that is also going to use up all the memory.
Does
> > anyone know some other methods I could try?
>
> #!/bin/sh
> FILE=whatever
>
> split $FILE section.
> for file in section.*
> do
> sort $file > $file.sorted
> done
> sort -m section.*.sorted > sorted-version
>
> rm section.*
>
>
> --
> Brian Hatch                  "Londo, do you know where you are?"
>    Systems and               "Either in Medlab, or in Hell.
>    Security Engineer          Either way, the decor needs work."
> http://www.ifokr.org/bri/
>
> Every message PGP signed
>

_____________________________________________________________
Seattle Perl Users Group Mailing List
POST TO: spug-list at mail.pm.org  http://spugwiki.perlocity.org
ACCOUNT CONFIG: http://mail.pm.org/mailman/listinfo/spug-list
MEETINGS: 3rd Tuesdays, Location Unknown
WEB PAGE: http://www.seattleperl.org




_____________________________________________________________
Seattle Perl Users Group Mailing List
POST TO: spug-list at mail.pm.org  http://spugwiki.perlocity.org
ACCOUNT CONFIG: http://mail.pm.org/mailman/listinfo/spug-list
MEETINGS: 3rd Tuesdays, Location Unknown
WEB PAGE: http://www.seattleperl.org


More information about the spug-list mailing list