[sf-perl] shuffling large numbers of image files

Mon Oct 16 09:57:47 PDT 2006

I'm helping a client organize a collection of image files which are
currently stored on hundreds of (floppy, jaz, zip) disks.  Although
the files may contain useful metadata, they are largely opaque to
mechanized analysis (e.g., keyword search is OUT).

The client is using Mac OS X, so a wide range of tools (e.g., iPhoto,
Perl scripts, Ruby on Rails, Spotlight) can be applied.  I'm hoping
for some suggestions on what kinds of tools might be most useful.

The first step is to capture the files on a single (500 GB) disk, in
a manner that retains the origin information (i.e., media type and
external inscriptions).  This will result in a tree of the form:

  .../Old_Media
    Zip_100s
      A Big Project, 2001
        Big Project #1
        Big Project #2
        ...
      A Quick Sketch, 2000
  ...

The next step is to create a re-organized tree that facilitates
use of the material.  In all likelihood, this will be organized
by project within year, as:

  .../Old_Image
    2000
      A Quick Sketch, 2000
    2001
      A Big Project, 2001
        Big Project #1
        Big Project #2
  ...

My suggestion for this step is to create a copy of the Old_Media
tree (Tmp_Image) and a target directory (Old_Image).  Then, drag
images and/or folders from Tmp_Image to Old_Image (or Trash).

Eventually, Tmp_Image will contain nothing of interest, so it can
be discarded.  At this point, I'm recommending that the client copy
Old_Image to Cur_Image and then use iPhoto to further organize and
annotate the images.

FWIW, the OSX Finder will display thumbnails of selected files.
Double-clicking on the file will bring up Preview, which can pan
and zoom through the image.  Finally, Get Info and Spotlight allow
access to file metadata.

It is very likely that the Old_Media tree will contain identical
(sub-trees of) image files.  In most cases, this will result from
successive backups of projects, copying of folders and files, etc.
Although some file and folder names may change, most will not.

Even ignoring the file names, the image files can be matched up by
their content.  For example, I can create an MD5 checksum for each
file, look for matching checksums, and then (as a safety net) do a
bit-for-bit comparison of putative duplicates.

However, a report of all duplicate files might well swamp the user
in data.  It would be better to identify and present duplicate (or
evolving) folders and let the user determine which one(s) to save.

Although the identification part is a bit tricky, I'm sure that I
can handle that part.  The hard part, however, is deciding exactly
what information to present and how to present it.  Suggestions?

Also, any other ideas on approaches and/or tools are solicited.

-r
-- 
http://www.cfcl.com/rdm            Rich Morin
http://www.cfcl.com/rdm/resume     rdm at cfcl.com
http://www.cfcl.com/rdm/weblog     +1 650-873-7841

Technical editing and writing, programming, and web development