[mplspm]: In a sorted list [better]
Dion Almaer
dion at almaer.com
Fri Mar 22 14:31:45 CST 2002
Why not use "uniq" (or even "sort -u") ?
D
> -----Original Message-----
> From: owner-mpls at pm.org [mailto:owner-mpls at pm.org]On Behalf Of Josh Aas
> Sent: Friday, March 22, 2002 11:57 AM
> To: mpls at pm.org
> Subject: Re: [mplspm]: In a sorted list [better]
>
>
> Thanks a lot guys! Truthfully, I found a much better algorithm for my bigger
> problem that does not involve a binary search, but it has been fun learning
> about it. Surprisingly, it does not take that much time or even memory to
> sort 2000000 strings. It makes no sense to me, but it works perfectly. For
> the curious, my bigger problem was to take all the words in a 16 MB text
> file (there was one word on each line) and remove all of the duplicates. I
> wasn't using my head the first time around. Here's my app, that does all the
> sorting and writes to file. I'm surprised that this works so quickly and
> without fail... Anyone know how that sort routine finishes in about 6
> seconds when given 2 million strings? When I used the algorithm that
> involved binary searches, my app would have taken about 20 hours (as opposed
> to seconds) to complete. Go perl!
> -Josh
>
> #!/usr/bin/perl
>
> print "Loading file...\n";
>
> open (DATA, "/Users/Josh/Big_List.txt") || &CgiDie ("Cannot open
> wordlist.");
> my @data = <DATA>;
> close (DATA);
>
> print "Loading file done.\n";
>
> print "Cleaning...\n";
> foreach (@data) {
> chomp;
> }
> print "Done cleaning.\n";
>
> print "Sorting...\n";
> @data = sort @data;
> print "Done sorting.\n";
>
> $wordcount = scalar(@data);
> print "There are ";
> print $wordcount;
> print " words in the file.\n";
> $duplicates = 0;
> $unique = 0;
>
> for ($i = 0; $i < $wordcount; $i++) {
> $newdata[$unique] = $data[$i];
> $unique++;
> $b = 1;
> while (1) {
> if ($data[$i] eq $data[$i + $b]) {
> $b++;
> }
> else {
> $i += $b;
> $duplicates += $b - 1;
> last;
> }
> }
> }
>
> open (ADDFILE,">>/Users/josh/Good_Wordlist.txt") || &CgiDie ("Cannot open
> write to edited wordlist.");
> flock(ADDFILE, LOCK_EX);
> foreach (@newdata) {
> print ADDFILE "$_\n";
> }
> close (ADDFILE);
>
> print "Duplicates removed: ";
> print $duplicates;
> print "\n";
> print "Done!\n";
>
> exit;
>
>
>
> --------------------------------------------------
> Minneapolis Perl Mongers mailing list
>
> To unsubscribe, send mail to majordomo at pm.org
> with "unsubscribe mpls" in the body of the message.
>
--------------------------------------------------
Minneapolis Perl Mongers mailing list
To unsubscribe, send mail to majordomo at pm.org
with "unsubscribe mpls" in the body of the message.
More information about the Mpls-pm
mailing list