[DFW.pm] hard links are not dupes!
Tommy Butler
dfwpm at internetalias.net
Mon Dec 23 11:28:57 PST 2013
We've had a lot of recent list sign-ups lately in relation to the
hackathon, so this is just a reminder of things already discussed in our
last meeting which you may have missed.
* You are deduping an ext4 filesystem and that's all we're saying
about it. You have server access so if you want to poke it, you can ;-)
* /*The test data you are working with is peppered randomly with hard
and soft links.*/
* Hard links to files are not duplicates. They point to the same
underlying storage and should therefore be considered as files
already de-duped. We are aware that technically hard links _are_
files in and of themselves, but they are metadata and not storage.
You'll have to decide how to optimize for this. It's a very tricky
tradeoff. Don't base your decision on how frequently links occur in
the test dataset; the final dataset will not be identical and is not
even guaranteed to be similar.
* Symlinks are also NOT duplicates.
* Your code is indeed going to face /directory/ symlinks as well as
file links, so you'll need to take care not to get stuck in
directory recursion loops.
--
Tommy Butler, John Fields
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/dfw-pm/attachments/20131223/eff9b839/attachment.html>
More information about the Dfw-pm
mailing list