[mplspm]: In a sorted list [better]

Josh Aas josha at mac.com
Fri Mar 22 12:57:06 CST 2002


Thanks a lot guys! Truthfully, I found a much better algorithm for my bigger
problem that does not involve a binary search, but it has been fun learning
about it. Surprisingly, it does not take that much time or even memory to
sort 2000000 strings. It makes no sense to me, but it works perfectly. For
the curious, my bigger problem was to take all the words in a 16 MB text
file (there was one word on each line) and remove all of the duplicates. I
wasn't using my head the first time around. Here's my app, that does all the
sorting and writes to file. I'm surprised that this works so quickly and
without fail... Anyone know how that sort routine finishes in about 6
seconds when given 2 million strings? When I used the algorithm that
involved binary searches, my app would have taken about 20 hours (as opposed
to seconds) to complete. Go perl!
-Josh

#!/usr/bin/perl

print "Loading file...\n";

open (DATA, "/Users/Josh/Big_List.txt") || &CgiDie ("Cannot open
wordlist.");
my @data = <DATA>;
close (DATA);

print "Loading file done.\n";

print "Cleaning...\n";
foreach (@data) {
    chomp;
}
print "Done cleaning.\n";

print "Sorting...\n";
@data = sort @data;
print "Done sorting.\n";

$wordcount = scalar(@data);
print "There are ";
print $wordcount;
print " words in the file.\n";
$duplicates = 0;
$unique = 0;

for ($i = 0; $i < $wordcount; $i++) {
    $newdata[$unique] = $data[$i];
    $unique++;
    $b = 1;
    while (1) {
        if ($data[$i] eq $data[$i + $b]) {
            $b++;
        }
        else {
            $i += $b;
            $duplicates += $b - 1;
            last;
        }
    }
}

open (ADDFILE,">>/Users/josh/Good_Wordlist.txt") || &CgiDie ("Cannot open
write to edited wordlist.");
flock(ADDFILE, LOCK_EX);
foreach (@newdata) {
    print ADDFILE "$_\n";
}
close (ADDFILE);

print "Duplicates removed: ";
print $duplicates;
print "\n";
print "Done!\n";

exit;



--------------------------------------------------
Minneapolis Perl Mongers mailing list

To unsubscribe, send mail to majordomo at pm.org
with "unsubscribe mpls" in the body of the message.



More information about the Mpls-pm mailing list