[DFW.pm] what is a hard link, and what should my deduper do with them?

Tommy Butler dfwpm at internetalias.net
Mon Dec 30 15:26:32 PST 2013


On 12/30/2013 04:28 PM, Tom Metro wrote:
> Tommy Butler wrote:
>> ...other hard links should be considered, as already
>> stated in the rules, "files already deduped".
>>
>>       SCENARIO:
>>
>> The three files below have identical content:
>> /foo/bar/baz.txt -> ( inode 12345 )
>> /foo/car/daz.txt -> ( inode 12345 )
>> /foo/far/gaz.txt -> ( inode 67890 )
>>
>>
>>       OUTCOME:
>>
>> /foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt
>> because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and
>> because /foo/car/daz.txt is a hard link.
>
> So then the output might look like:
> /foo/bar/baz.txt /foo/far/gaz.txt
YES! :)

> while /foo/car/daz.txt is simply eliminated from consideration and not
> output at all?
Yep.

> The problem with this approach, if you are striving for a useful tool
> and not just a programming exercise, is that you don't know which of
> the aliases is the name most familiar to the user who will be
> reviewing the report.
And just when I've about finalized the output spec and created an output
file to put up on the git repo for diffing ... this.  :-)

You are right, insofar as we are working to develop a useful tool and
not create throw-away code to use once for a competition.  I won't pick
nits over the likelihood of a real-world scenario where hardlinks exist
in Joe User's music collection.  However for the sake of simplicity
we're not going to require contestants to go this extra mile at this
time.  Everyone is free to implement an output format that reports hard
link groupings and to do so for unredeemable "bonus" points.  Should
anyone, like me, want a useful tool when they are done with their code,
they should strive to make it as robust and feature-ful as possible
without sacrificing too much performance.  After all, there is a
winnings category for code that provides the most comprehensive feature set.

As promised recently, a finalized "expected" output format (against
which the product of each contestant's code will be diff'd) is
forthcoming.  It seems like a glaring oversight that this wasn't part of
the original rules specification.  I don't fault myself too much for
this given the fact that the output format is to be so simple and up to
this point everyone is expected to be coding against the problem and not
the output.  Watch for that email later on this evening.

Thanks Tom, and thanks all!

--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/dfw-pm/attachments/20131230/d441ee78/attachment.html>


More information about the Dfw-pm mailing list