[DFW.pm] what is a hard link, and what should my deduper do with them?

Tom Metro tmetro+dfw-pm at gmail.com
Mon Dec 30 14:28:56 PST 2013


Tommy Butler wrote:
> ...other hard links should be considered, as already
> stated in the rules, "files already deduped".
>
>       SCENARIO:
>
> The three files below have identical content:
> /foo/bar/baz.txt -> ( inode 12345 )
> /foo/car/daz.txt -> ( inode 12345 )
> /foo/far/gaz.txt -> ( inode 67890 )
>
>
>       OUTCOME:
>
> /foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt
> because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and
> because /foo/car/daz.txt is a hard link.

So then the output might look like:
/foo/bar/baz.txt /foo/far/gaz.txt

while /foo/car/daz.txt is simply eliminated from consideration and not 
output at all?

The problem with this approach, if you are striving for a useful tool 
and not just a programming exercise, is that you don't know which of the 
aliases is the name most familiar to the user who will be reviewing the 
report.

Another possibility might be to report hardlinks in a way that visually 
groups them together, then any place one member of a hardlink would 
appear in the output, you replace it with the group:
(/foo/bar/baz.txt /foo/car/daz.txt) /foo/far/gaz.txt

(With members of the group being sub-sorted asciibetically, and the 
first member of the group being used as the key when sorting the overall 
list of duplicates.)

But this is still not quite ideal. This implies that you ignore 
collections of hardlinks that don't also have a duplicate file. Chances 
are good if the user is interested in duplicates, they're also 
interested to know about what hardlinks (aliases) exist.

Plus, most characters you choose for grouping could potentially be part 
of the file name, although the same could be said for the space delimiters.

So instead, you could simply produce a report of hardlinks at the end, 
and any place a file appears in a duplicate report that has multiple 
aliases, you always show the asciibetically first name:

Duplicates:
/foo/bar/baz.txt /foo/far/gaz.txt
...

Aliases:
/foo/bar/baz.txt /foo/car/daz.txt
...

  -Tom

-- 
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/


More information about the Dfw-pm mailing list