[DFW.pm] Deduplication Hackathon: Formal Output Specification

Tommy Butler dfwpm at internetalias.net
Mon Dec 30 18:40:52 PST 2013


For your deduplication hackathon code entry, the output of your Perl app
should be as follows:

 1. Each grouping of duplicates should be sorted and printed out all on
    one line, by filename, deliminated by a tab character.
 2. The lines of output should be sorted.
 3. The sort you should use for both the lines of output and the file
    name groupings themselves is: sort { $a cmp $b }
 4. Any output leading up to a delimiter of 30 dashes on its own line
    will be ignored.  Any output coming after a second line comprised of
    30 dashes is also ignored.  These delimiter lines are optional if
    your output is solely comprised of the sorted results and nothing
    else.  Otherwise, use the space to prefix your results with status
    messages or a status indicator (progress bar, etc), and optionally
    follow up your results with a summary of what your code
    encountered.  See example at bottom of message.

Your code can actually output whatever it wants, so long as there is a
way to call it where it produces output according to the spec as
outlined above.

An example is provided in the lines below, and in the screenshot that
follows.  This output is generated by the code as found on github at
https://github.com/tommybutler/dupfind

In just a few minutes I will put up on (github at the same url) the
correct output for the reference data that is currently on the contest
server under /dedup.  */Please take time to compare your code output to
the output of the "reference design" code on github. If your output is
not identical, then you will be disqualified for producing incorrect
results.  /*If you believe the reference design is incorrect, then
please submit a bug report and/or a patch!!

--Tommy Butler
------------------------------------------------------------------------

    $ ./dupfind --format robot --dir .
    ** SCANNING ALL FILES
    ** CHECKSUMMING SIZE DUPLICATES
    ** DISPLAYING OUTPUT
    ------------------------------
    ./.git/logs/HEAD    ./.git/logs/refs/heads/master
    ./.git/refs/heads/master    ./.git/refs/remotes/origin/master
    ./bar    ./baz    ./foo
    ------------------------------
    ** TOTAL SCANNED: 86
    ** TOTAL DUPES:   4
    ** SCAN TIME:     0.00824308 wallclock secs ( 0.00 usr +  0.01 sys
    =  0.01 CPU)
    ** DELETION TIME: 0

------------------------------------------------------------------------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/dfw-pm/attachments/20131230/c40610c5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: edihfghd.png
Type: image/png
Size: 490429 bytes
Desc: not available
URL: <http://mail.pm.org/pipermail/dfw-pm/attachments/20131230/c40610c5/attachment-0001.png>


More information about the Dfw-pm mailing list