[DFW.pm] hard links are not dupes!

Tommy Butler dfwpm at internetalias.net
Mon Dec 23 11:28:57 PST 2013


We've had a lot of recent list sign-ups lately in relation to the
hackathon, so this is just a reminder of things already discussed in our
last meeting which you may have missed.

  * You are deduping an ext4 filesystem and that's all we're saying
    about it.  You have server access so if you want to poke it, you can ;-)
  * /*The test data you are working with is peppered randomly with hard
    and soft links.*/
  * Hard links to files are not duplicates.  They point to the same
    underlying storage and should therefore be considered as files
    already de-duped.  We are aware that technically hard links _are_
    files in and of themselves, but they are metadata and not storage. 
    You'll have to decide how to optimize for this.  It's a very tricky
    tradeoff.  Don't base your decision on how frequently links occur in
    the test dataset; the final dataset will not be identical and is not
    even guaranteed to be similar.
  * Symlinks are also NOT duplicates.
  * Your code is indeed going to face /directory/ symlinks as well as
    file links, so you'll need to take care not to get stuck in
    directory recursion loops.

-- 
Tommy Butler, John Fields


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/dfw-pm/attachments/20131223/eff9b839/attachment.html>


More information about the Dfw-pm mailing list