[mplspm]: In a sorted list [better]

Dion Almaer dion at almaer.com
Fri Mar 22 14:31:45 CST 2002


Why not use "uniq" (or even "sort -u") ?

D

> -----Original Message-----
> From: owner-mpls at pm.org [mailto:owner-mpls at pm.org]On Behalf Of Josh Aas
> Sent: Friday, March 22, 2002 11:57 AM
> To: mpls at pm.org
> Subject: Re: [mplspm]: In a sorted list [better]
> 
> 
> Thanks a lot guys! Truthfully, I found a much better algorithm for my bigger
> problem that does not involve a binary search, but it has been fun learning
> about it. Surprisingly, it does not take that much time or even memory to
> sort 2000000 strings. It makes no sense to me, but it works perfectly. For
> the curious, my bigger problem was to take all the words in a 16 MB text
> file (there was one word on each line) and remove all of the duplicates. I
> wasn't using my head the first time around. Here's my app, that does all the
> sorting and writes to file. I'm surprised that this works so quickly and
> without fail... Anyone know how that sort routine finishes in about 6
> seconds when given 2 million strings? When I used the algorithm that
> involved binary searches, my app would have taken about 20 hours (as opposed
> to seconds) to complete. Go perl!
> -Josh
> 
> #!/usr/bin/perl
> 
> print "Loading file...\n";
> 
> open (DATA, "/Users/Josh/Big_List.txt") || &CgiDie ("Cannot open
> wordlist.");
> my @data = <DATA>;
> close (DATA);
> 
> print "Loading file done.\n";
> 
> print "Cleaning...\n";
> foreach (@data) {
>     chomp;
> }
> print "Done cleaning.\n";
> 
> print "Sorting...\n";
> @data = sort @data;
> print "Done sorting.\n";
> 
> $wordcount = scalar(@data);
> print "There are ";
> print $wordcount;
> print " words in the file.\n";
> $duplicates = 0;
> $unique = 0;
> 
> for ($i = 0; $i < $wordcount; $i++) {
>     $newdata[$unique] = $data[$i];
>     $unique++;
>     $b = 1;
>     while (1) {
>         if ($data[$i] eq $data[$i + $b]) {
>             $b++;
>         }
>         else {
>             $i += $b;
>             $duplicates += $b - 1;
>             last;
>         }
>     }
> }
> 
> open (ADDFILE,">>/Users/josh/Good_Wordlist.txt") || &CgiDie ("Cannot open
> write to edited wordlist.");
> flock(ADDFILE, LOCK_EX);
> foreach (@newdata) {
>     print ADDFILE "$_\n";
> }
> close (ADDFILE);
> 
> print "Duplicates removed: ";
> print $duplicates;
> print "\n";
> print "Done!\n";
> 
> exit;
> 
> 
> 
> --------------------------------------------------
> Minneapolis Perl Mongers mailing list
> 
> To unsubscribe, send mail to majordomo at pm.org
> with "unsubscribe mpls" in the body of the message.
> 


--------------------------------------------------
Minneapolis Perl Mongers mailing list

To unsubscribe, send mail to majordomo at pm.org
with "unsubscribe mpls" in the body of the message.



More information about the Mpls-pm mailing list