[pm-h] maildir - remove duplicate messages

G. Wade Johnson gwadej at anomaly.org
Sun Mar 28 07:06:04 PDT 2010


On Sun, 28 Mar 2010 11:28:29 +0100
"Russell L. Harris" <rlharris at oplink.net> wrote:

> I have on the order of 10 Gb of mail files.
> 
> Most of the files are in maildir format; a few are in mbox format.
> 
> The system is Debian GNU/Linux.
> 
> I would like to eliminate duplicate messages.  There appear to be, on 
> the average, perhaps four or five copies of each message.
> 
> I also would like to sort the messages on the To: and From: fields, 
> saving only certain matches.
> 
> I have been searching with Google for "maildir delete duplicate
> perl", but I have not yet found a script which looks promising.
> 
> Is there a good standard approach, script, or application for this
> problem?

I would probably take a multi-step approach. I would look for a module
on CPAN that reads the maildir format (for example,
Email::Folder::Maildir, which I found from search.cpan.org).

I would use that to match the To and From fields and remove any that I
didn't want.

The best way to find duplicates is probably through the use of a
message digest and a hash. Walk the messages, passing each through
Digest::SHA1 or Digest::MD5 and use the result as the key to a hash.

If it already exists in the hash, delete the message. If not, add it to
the hash.

Admittedly, that's just an outline of an approach, but it should get
you started.

G. Wade
-- 
The purpose of software engineering is to control complexity, not to
create it.                                          -- Dr. Pamela Zave


More information about the Houston mailing list