From dfwpm at internetalias.net Tue Dec 10 15:29:29 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Tue, 10 Dec 2013 17:29:29 -0600
Subject: [DFW.pm] Meeting Announcement - Hackathon - Bring Your A Game
Message-ID: <52A7A3D9.2090004@internetalias.net>
Fellow Perl Mongers,
Tomorrow night at _*7 PM*_ we meet for the month of December. We will
be holding a hackathon of sorts, a competitive educational event. The
topic and objective of the competition will be announced at the
beginning of the meeting, which will also be broadcast live in a Google
Hangout which you can join remotely (link and info for the hangout will
be sent to this mailing list about an hour or two before the meeting).
Instructions and assistance will be provided for beginners or anyone
else who needs help. Participation is not mandatory; you can just watch
or fly wingman for your favorite competitor if you want.
Bring with you:
* Laptop
* Friend, colleague, or someone you mentor
* Google hangouts browser plugin
or mobile app (please
install it before the meeting)
* github account (please set this up before the
meeting)
* SSH client ready to connect to a Linux server (windows users can use
PuTTY
for free)
* filezilla or your SFTP
software of choice
* Your Mad Perl skills!
Location Info:
2995 Ladybird Lane, Dallas, TX
www.dallasmakerspace.org
(214) 699-6537
See you at 7 PM!
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Tue Dec 10 19:06:18 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Tue, 10 Dec 2013 21:06:18 -0600
Subject: [DFW.pm] Meeting Announcement - Hackathon - Bring Your A Game
In-Reply-To: <52A7A3D9.2090004@internetalias.net>
References: <52A7A3D9.2090004@internetalias.net>
Message-ID: <52A7D6AA.6030806@internetalias.net>
If you're going to participate in tomorrow's contest and you will
require a perlbrew of your own, please let me know as soon as possible
and I'll arrange a brew with v5.18.1 x64.
If you aren't familiar with perlbrew and don't anticipate needing an
isolated Perl environment for yourself and/or are content to use the
default system Perl, you can safely disregard this message.
If anyone feels this puts contestants on uneven footing, rest assured
that we'll re-run code benchmarks against the same Perl if there are any
"close" results.
--Tommy Butler
On 12/10/2013 05:29 PM, Tommy Butler wrote:
> Fellow Perl Mongers,
>
> Tomorrow night at _*7 PM*_ we meet for the month of December. We will
> be holding a hackathon of sorts, a competitive educational event.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From noreply-f3605238 at plus.google.com Wed Dec 11 12:06:30 2013
From: noreply-f3605238 at plus.google.com (Tommy Butler (Google+))
Date: Wed, 11 Dec 2013 12:06:30 -0800 (PST)
Subject: [DFW.pm] Tommy Butler invited you to DFW Perl Mongers Hackathon
References:
Message-ID:
Tommy Butler invited you to DFW Perl Mongers Hackathon
Wed, December 11, 7:00 PM CST
Rob Hoelz, Mark Jason Dominus, Joel Bernstein and 120 more are invited
View Invitation:
https://plus.google.com/_/notifications/ngemlink?&emid=CLiRzJ78qLsCFUjjQAodcSoAAA&path=%2Fevents%2Fc84tv7hiosleru73tm3tm5el0k0%3Fgpinv%3DAMIXal9uIa-qOfX1wiKDU_jSPRMMDAX0UKZ_t_UHSY05ePoxcfqKJ1TDDqiXGc8wP2r5PSojHkNC0LhSNtWKcUp6syyLhZBT-CY7B6GjS45xRQ4sufyMblY%26gpsrc%3Dgpev0&dt=1386792391250&uob=14
Join us online for the launch of our DFW.pm "winter of code" hackathon
competition (in which you too are invited to particpate). ?The meeting
proper will happen in person at 7 pm tonight in our usual meetingplace:
2995 Ladybird Lane, Dallas, TX
www.dallasmakerspace.org
(214) 699-6537
A presentation will be given to reveal the objective of the competition and
assistance will be provided in setting up access to the git repository and
code contest server.
Bring a laptop, bring a friend, and bring your Perl skills! ?See you
tonight!
** Please have your google hangouts browser plugin or mobile app installed
and ready to go.
This notification was sent to dfw-pm at pm.org; Go to your notification
delivery settings to update your address:
https://plus.google.com/_/notifications/ngemlink?&emid=CLiRzJ78qLsCFUjjQAodcSoAAA&path=%2Fsettings%2Fplus&dt=1386792391250&uob=14
Manage subscriptions to change what emails you receive from Google+:
https://plus.google.com/_/notifications/ngemlink?&emid=CLiRzJ78qLsCFUjjQAodcSoAAA&path=%2Fsettings%2Fplus&dt=1386792391250&uob=14
Google Inc., 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Wed Dec 11 20:16:10 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Wed, 11 Dec 2013 22:16:10 -0600
Subject: [DFW.pm] NY.pm joining our hackathon,
and slides from tonight's meeting
In-Reply-To: <52A7D6AA.6030806@internetalias.net>
References: <52A7A3D9.2090004@internetalias.net>
<52A7D6AA.6030806@internetalias.net>
Message-ID: <52A9388A.8050408@internetalias.net>
The slides from tonight's meeting can be viewed --> here
David Golden from NY.pm is extending our contest to his own Perl Mongers
group so they can compete!
To get your ssh login set up on the competition server and for other
setup assistance, send an email to dfwpm at internetalias dot com
Thank you all for your participation. This is going to be fun!
--Tommy Butler
On 12/10/2013 09:06 PM, Tommy Butler wrote:
> If you're going to participate in tomorrow's contest and you will
> require a perlbrew of your own, please let me know as soon as possible
> and I'll arrange a brew with v5.18.1 x64.
>
> If you aren't familiar with perlbrew and don't anticipate needing an
> isolated Perl environment for yourself and/or are content to use the
> default system Perl, you can safely disregard this message.
>
> If anyone feels this puts contestants on uneven footing, rest assured
> that we'll re-run code benchmarks against the same Perl if there are
> any "close" results.
>
> --Tommy Butler
>
> On 12/10/2013 05:29 PM, Tommy Butler wrote:
>> Fellow Perl Mongers,
>>
>> Tomorrow night at _*7 PM*_ we meet for the month of December. We
>> will be holding a hackathon of sorts, a competitive educational event.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Thu Dec 12 13:53:59 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Thu, 12 Dec 2013 15:53:59 -0600
Subject: [DFW.pm] Hackathon Rules and Participation
Message-ID: <52AA3077.9060800@internetalias.net>
/*Sorry for the length of the email*/, but being a formal contest (and
one in which increasing interest is growing even outside DFW), I need to
clarify some things for people who didn't make it to the meeting either
on or off line last night.
Here goes...
The DFW Perl Mongers Winter of Code Deduplication Hackathon
Participation
Any Perl Monger anywhere may participate so long as he/she is vouched
for by their PM group leader, prominent Perl community member, or CPAN
author with a module released prior to this contest.
Hackathon Server Accounts
On a testing/development contest server is being provided. Everyone who
wants to participate will get an SSH login and optionally a perlbrew if
they ask for it. They can also install their own brew. Anyone who
wants an account should send their public SSH key to
dfwpm_at_internetalias_dot_net. Password-based logins won't be
allowed. This server will give everyone the chance to develop their
code for 1 month in the same environment in which it will be benchmarked
during the formal head-to-head contest on January 8th (which will be
broadcast live in a Google Hangout as usual, so physical presence at the
Dallas Makerspace isn't necessary).
1GB disk space can be consumed per participant; space consumption will
be monitored as will bandwidth consumption. ("Don't be a jerk").
Contestants should rely on github for code storage because I will wipe
and recreate the server before the actual contest. Disk storage should
therefore be considered volatile and git should be leveraged as the
mechanism for data and code that contestants want to be persistent
across the server rebuild.
Environment
The server will be running Debian Linux 7, stable branch Wheezy. It
will be hosted on port 2222 at perl.atrixnet.com and a security lockout
mechanism is in place for four failed logins in a row (i.e.- don't try
to log in before I set up your key).
As stated above, a full rebuild of the server will happen a couple days
or so before the live contest at next month's meeting on humpday January
8th in order to ensure fairness and prevent foul play. When the contest
server comes back online I will restore everyone's code via the cloning
of their repo, but their system logins will not be restored -- no one
will have access to the competition server at that time except the
judges (Tommy Butler and John Fields). David Golden of NY.pm and
Patrick Michaud of our local group are honorary judges.
Conduct
In the spirit of our community I only ask that no one do or try to do
anything unethical, malicious, unfair, or abusive on the server --
including being a resource hog. Basically any rules that apply at a
YAPC event apply to this hackathon, as do the dictates of common sense
and decency :)
Test Data
The test data will be on a read-only volume mounted read-only on
/dedup. The deduplication code from each contestant is simply required
to accurately detect all duplicate files randomly generated in a 100
gigabyte mass of also randomly generated files and directories. The
individual files and directories will have random names as well. Detect
the duplicate files -- It's that simple. The volume will have both
symlinks and hardlinks and code will need to correctly handle that. If
code relies on heuristics of the data volume in order to achieve
performance improvements, the author of such code will be disappointed;
the random data will be randomly regenerated again before the contest
and will not have the same number of symlinks/hardlinks, files, same
filenames, file sizes, directories or directory depths. All contestants
will be running their code against the same data volume.
Contest Rules
...Are founded on the information in the slide presentation
.
All code will be tested against the same Perl (the latest available
stable version before the contest, likely 5.18.0) and the
best-out-of-two benchmark time will be used for each participant
(because of hardware-based CPU and disk caching that we can't prevent).
No code should write to disk or in-memory filesystems such as /run/shm.
If we catch the code doing writes, it is automatic grounds for
disqualification. This is to prevent results-caching between runs.
Participants should disclose ahead of time what top-level CPAN modules
they need installed for the correct operation of their code. The code
and contestants should conform to the rules set forth in the slides
from last night's Perl Mongers meeting. The code will be reviewed by
the judge panel prior to execution. Unintelligible or obfuscated code
won't be accepted. Read the slides for further information.
Rules are subject to amendment by a majority vote of the judging panel,
in the event that rules must be modified to insure fairness or fix
problems that arise which prevent smooth, convenient execution of the
contest. Your suggestions are always welcome.
Please let us know your thoughts and remember to mail in your public SSH
key to dfwpm_at_internetalias_dot_net if you are going to participate.
--Tommy Butler, John Fields
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Wed Dec 18 14:50:01 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Wed, 18 Dec 2013 16:50:01 -0600
Subject: [DFW.pm] Hackathon Rules and Participation
In-Reply-To: <52AA3077.9060800@internetalias.net>
References: <52AA3077.9060800@internetalias.net>
Message-ID: <52B22699.502@internetalias.net>
UPDATE: now participating in addition to DFW.pm are members of the
*Philadelphia*, *New York*, and *Atlanta* Perl Mongers groups. We're
nation-wide. I'd like to see some international participation too, so
please spread the word and post this link on your blog/social media
stream: http://perlmonks.org/?node_id=1067570
PS - Today someone requested that emacs be installed on the free-to-use
dev/contest server. Alas, I obliged. If anyone needs a particular
package and/or wants to discuss it off-list, you can just email dfwpm at
internetalias dot net.
PPS - Reminder that formal contest rules are available at http://dfw.pm.org
--Tommy Butler
On 12/12/2013 03:53 PM, Tommy Butler wrote:
> /*Sorry for the length of the email*/, but being a formal contest (and
> one in which increasing interest is growing even outside DFW), I need
> to clarify some things for people who didn't make it to the meeting
> either on or off line last night.
>
> Here goes...
>
>
> The DFW Perl Mongers Winter of Code Deduplication Hackathon
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Wed Dec 18 15:54:48 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Wed, 18 Dec 2013 17:54:48 -0600
Subject: [DFW.pm] Dedup Contest - SSH Access Tip
In-Reply-To:
References:
Message-ID: <52B235C8.3030909@internetalias.net>
This tip comes from one of our contest participants:
/"Just in case it would help others, here is the chunk I just added to
my ~/.ssh/config file, to make sure that I never err with the wrong port
number://"/
# Dallas Fort Worth Perl Mongers - disk deduplication contest 2013-12-18
Host dfw
Hostname perl.atrixnet.com
Port 2222
User PUT YOUR USERNAME HERE
PreferredAuthentications publickey
IdentityFile ~/.ssh/id_rsa
IdentitiesOnly yes
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Thu Dec 19 08:04:05 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Thu, 19 Dec 2013 10:04:05 -0600
Subject: [DFW.pm] OT - Perl/Sysadmin opening at my company
Message-ID: <52B318F5.4040700@internetalias.net>
As is our policy, we don't accept "recruiter spam" on our list, but
share meaningful professional networking leads.
My company in Irving, TX is looking to fill a position in the near
future for someone with Unix experience who has at least basic Perl
scripting skills or better.
Interested parties follow up to dfwpm at internetalias dot net.
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Thu Dec 19 14:55:00 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Thu, 19 Dec 2013 16:55:00 -0600
Subject: [DFW.pm] Hackathon Just Got Real
Message-ID: <52B37944.7050009@internetalias.net>
It's kind of official now. Our hackathon competition just made the
front page news feed on perl.org
--Tommy Butler, John Fields
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Fri Dec 20 08:55:25 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Fri, 20 Dec 2013 10:55:25 -0600
Subject: [DFW.pm] A Warm Welcome & I/O Niceness Tip
Message-ID: <52B4767D.1000705@internetalias.net>
A warm welcome to our newest list members, friends from other Perl
Monger groups around the globe. Recently joining us as part of the
hackathon are some of our neighbors from New York City, Brooklyn,
Philadelphia, Atlanta, Chicago, and Sydney Australia.
If you still need an ssh account set up on the competition server, check
out dfw.pm.org for details.
And finally, a tip for those running code that does rapid-fire reading
on the filesystem: please consider running it through ionice. It's the
nice thing to do.
Example (from the shell prompt): *$ nice -n 19 ionice -c2 -n7
./your_perl_program.pl*
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Mon Dec 23 11:28:57 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 23 Dec 2013 13:28:57 -0600
Subject: [DFW.pm] hard links are not dupes!
Message-ID: <52B88EF9.5080401@internetalias.net>
We've had a lot of recent list sign-ups lately in relation to the
hackathon, so this is just a reminder of things already discussed in our
last meeting which you may have missed.
* You are deduping an ext4 filesystem and that's all we're saying
about it. You have server access so if you want to poke it, you can ;-)
* /*The test data you are working with is peppered randomly with hard
and soft links.*/
* Hard links to files are not duplicates. They point to the same
underlying storage and should therefore be considered as files
already de-duped. We are aware that technically hard links _are_
files in and of themselves, but they are metadata and not storage.
You'll have to decide how to optimize for this. It's a very tricky
tradeoff. Don't base your decision on how frequently links occur in
the test dataset; the final dataset will not be identical and is not
even guaranteed to be similar.
* Symlinks are also NOT duplicates.
* Your code is indeed going to face /directory/ symlinks as well as
file links, so you'll need to take care not to get stuck in
directory recursion loops.
--
Tommy Butler, John Fields
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Mon Dec 23 17:18:21 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 23 Dec 2013 19:18:21 -0600
Subject: [DFW.pm] contest optimization strategy clarifications - BRING IT
Message-ID: <52B8E0DD.6010109@internetalias.net>
Hi,
With all the recent list sign-ups we've had some questions raised
off-list. I'd like to address these one time, for the group, instead of
one-by-one. The questions:
1. I want to write to a *destroyed-on-exit* in-memory database (SQLite)
or *destroyed-on-exit *tied hash (BDB) = THIS IS OK
2. My code depends on modules that write temp files that persist
between executions = THIS IS BENT*
3. My code requires a C compiler on the system = THIS IS BENT**
If your code/design looks like item number 1 in the list above, we're
not so concerned about your tied hash writing to the filesystem or
/dev/shm because we've decided to completely roll back the server before
every code execution that happens. Yup. It will take mere seconds to
roll back the server to its state before any code ran. Because of this,
we still discourage other kinds of disk-writes; we'd rather not deviate
from rules around which existing code has already been designed. We're
just going to make darn sure that any sneaky disk writes are completely
non-existent between your test runs. Fairness must be assured.
**Now if your code falls into a "THIS IS BENT" category, you're still
welcome to compete, and even win, but _you'll be doing so in a separate
competition category_, simply called "rule benders".
Why allow rule benders? Because we still want to see how fast things
can go. We asked for a Pure Perl solution, with the only exception
being that your code could depend on XS-based (that means compiled C
extensions) modules from the CPAN that were released /prior/ to the
beginning of the hackathon. Strangely enough, those rules get blurry
when you start using CPAN modules that depend on Inline::C or that
otherwise need access to a C compiler on the system at the time when the
code runs.
To John and I, this is a type of code optimization that isn't based on
Perl, but instead based on C. You can argue how much of it is Perl and
how much of it is C in terms of lines of code in one language or the
other, but you can't really easily prove how much of the performance
gains were Perl-based and how much of them were XS-based) and we aren't
going to NYTprof your code just to find out.
What it comes down to in terms of fairness is that Perl code which might
have lost the competition on its own will then have won by virtue of the
inclusion of low-level optimizations that just aren't in keeping with
the spirit of the contest as it was intended -- but which are still
awesome! We'd like to see what can be achieved through this kind of
go-baby-go, nitrous-oxide-injected, turbocharged Perl, but in your own
category of competition.
So go for it. If you want to be a bender, let's see what you've got.
Bring it, benders.
--Tommy Butler, John Fields
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Tue Dec 24 10:41:43 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Tue, 24 Dec 2013 12:41:43 -0600
Subject: [DFW.pm] example deduplication code and full disclosure
Message-ID: <52B9D567.2030204@internetalias.net>
Full disclosure: I'm not competing in the contest as John and I are
hosting it and have written the code that generates the random dataset.
However I wrote some example code that does work and which I'd like to
share to help give others a gentle push if anyone is having trouble
getting started.
Feel free to steal/fork/laugh at the code as much as you like. The code
isn't extensively commented but it is very readable. It's also simple
and concise and makes use of CPAN modules, some of which use XS code to
get performance gains -- which is within the rules for the "traditional
Perl solution" competition category.
One provision is that my code purposely does not solve all the
problems. In particular it doesn't handle hard links. That's up to you
to solve.
https://github.com/tommybutler/dupfind
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Thu Dec 26 19:24:35 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Thu, 26 Dec 2013 21:24:35 -0600
Subject: [DFW.pm] But what about Perl 6?
Message-ID: <52BCF2F3.801@internetalias.net>
It was mentioned to me tonight that a Perl 6 competition category seemed
to be missing from the deduplication hackathon competition. Indeed.
If you want to submit an entry in Perl 6, it will eagerly be accepted
and pitted against any other Perl 6 submissions. If you are the only
one to submit something in Perl 6, you win the Perl 6 competition
category. *Is there an**yone who would be interested in this?*
--Tommy Butler, John Fields
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robertbrucegray3 at gmail.com Thu Dec 26 20:18:39 2013
From: robertbrucegray3 at gmail.com (Bruce Gray)
Date: Thu, 26 Dec 2013 22:18:39 -0600
Subject: [DFW.pm] But what about Perl 6?
In-Reply-To: <52BCF2F3.801@internetalias.net>
References: <52BCF2F3.801@internetalias.net>
Message-ID:
I already have a Perl 6 solution coded.
On Dec 26, 2013 9:24 PM, "Tommy Butler" wrote:
> It was mentioned to me tonight that a Perl 6 competition category seemed
> to be missing from the deduplication hackathon competition. Indeed.
>
> If you want to submit an entry in Perl 6, it will eagerly be accepted and
> pitted against any other Perl 6 submissions. If you are the only one to
> submit something in Perl 6, you win the Perl 6 competition category. *Is
> there an**yone who would be interested in this?*
>
> --Tommy Butler, John Fields
>
> _______________________________________________
> Dfw-pm mailing list
> Dfw-pm at pm.org
> http://mail.pm.org/mailman/listinfo/dfw-pm
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Thu Dec 26 20:25:28 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Thu, 26 Dec 2013 22:25:28 -0600
Subject: [DFW.pm] But what about Perl 6?
In-Reply-To:
References: <52BCF2F3.801@internetalias.net>
Message-ID: <52BD0138.7070403@internetalias.net>
AWESOME =)
Do you want a custom-compiled Rakudo, or is the current stable release
(according to the Debian Rakudo Maintainers) ok with you? It's version
0.1~2012.01-1
All opinions welcome.
--Tommy Butler
On 12/26/2013 10:18 PM, Bruce Gray wrote:
>
> I already have a Perl 6 solution coded.
>
> On Dec 26, 2013 9:24 PM, "Tommy Butler" > wrote:
>
> It was mentioned to me tonight that a Perl 6 competition category
> seemed to be missing from the deduplication hackathon
> competition. Indeed.
>
> If you want to submit an entry in Perl 6, it will eagerly be
> accepted and pitted against any other Perl 6 submissions. If you
> are the only one to submit something in Perl 6, you win the Perl 6
> competition category. *Is there an**yone who would be interested
> in this?*
>
> --Tommy Butler, John Fields
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Sat Dec 28 09:55:28 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Sat, 28 Dec 2013 11:55:28 -0600
Subject: [DFW.pm] server advisory, and good manners
Message-ID: <52BF1090.1000601@internetalias.net>
I am going to try out a version of my code that is multi-threaded (and
no, I'm not competing with you; I'm just working on speeding up the code
that will be used to validate the results your own code produces). The
code is still on github in the aforementioned location. The reason I'm
sending this message is because the multi-threaded version may run the
server out of memory and/or lock it up. I have not been able to fix
that yet. It works fine on 30GB of data, but this will be the first
time I try it on 100GB. I will reboot the server if I lock it up, with
my apologies. I will send out another message right before I run the code.
If you have code that you know puts the stability of the server at risk,
it's not a problem. Just mail the list first before you run it so
others can be aware and so I can reboot if needed.
It may also be a good idea to advise the list before doing a big test
run so you don't throw other benchmarks off for logged-in users who are
also executing code.
You can check if other users are logged in by typing the `w` command.
Use the `top` or `htop` commands to see if other users are testing
code. Just look for "perl" or "rakudo" in the process table, or check
the processes based on username.
If you want to send a broadcast message to other users on the server
before you run your code, do so like this:
echo "hello all, I am about to run my code, --Jane Contestant" | wall
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Sat Dec 28 12:35:18 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Sat, 28 Dec 2013 14:35:18 -0600
Subject: [DFW.pm] server advisory, and good manners
In-Reply-To: <52BF1090.1000601@internetalias.net>
References: <52BF1090.1000601@internetalias.net>
Message-ID: <52BF3606.7090305@internetalias.net>
OK, I'm running the code now. No one is logged in at present.
--Tommy Butler
On 12/28/2013 11:55 AM, Tommy Butler wrote:
> I am going to try out a version of my code that is multi-threaded (and
> no, I'm not competing with you; I'm just working on speeding up the
> code that will be used to validate the results your own code
> produces). The code is still on github in the aforementioned
> location. The reason I'm sending this message is because the
> multi-threaded version may run the server out of memory and/or lock it
> up. I have not been able to fix that yet. It works fine on 30GB of
> data, but this will be the first time I try it on 100GB. I will
> reboot the server if I lock it up, with my apologies. I will send out
> another message right before I run the code.
>
> If you have code that you know puts the stability of the server at
> risk, it's not a problem. Just mail the list first before you run it
> so others can be aware and so I can reboot if needed.
>
> It may also be a good idea to advise the list before doing a big test
> run so you don't throw other benchmarks off for logged-in users who
> are also executing code.
>
> You can check if other users are logged in by typing the `w` command.
> Use the `top` or `htop` commands to see if other users are testing
> code. Just look for "perl" or "rakudo" in the process table, or check
> the processes based on username.
>
> If you want to send a broadcast message to other users on the server
> before you run your code, do so like this:
>
> echo "hello all, I am about to run my code, --Jane Contestant" | wall
>
> --Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Sat Dec 28 19:37:09 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Sat, 28 Dec 2013 21:37:09 -0600
Subject: [DFW.pm] what is a hard link,
and what should my deduper do with them?
Message-ID: <52BF98E5.8060505@internetalias.net>
What is a hard link? --> http://www.linfo.org/hard_link.html
Because of the nature of hard links, there's no way to know which hard
link existed first or which one is to be considered the "original" file,
because they point to the same underlying storage which has only one
lastmod/atime/mtime timestamp set.
As such, the official rule on the matter is that the
asciibetically-first fully-qualified file name will be considered the
original, while the other hard links should be considered, as already
stated in the rules, "files already deduped". The reason for this is
strictly for output and reporting consistencies. (We need this to
maintain a standard baseline output format, which I'll set forth in
another email coming soon).
SCENARIO:
The three files below have identical content:
/foo/bar/baz.txt -> ( inode 12345 )
/foo/car/daz.txt -> ( inode 12345 )
/foo/far/gaz.txt -> ( inode 67890 )
OUTCOME:
/foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt
because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and
because /foo/car/daz.txt is a hard link.
CODE:
I'm doing it like this. This is just an unoptimized example. TIMTOWTDI.
# this will automatically throw out all but one hardlink, with the
only surviving
# file name being the first asciibetically-sorted entry
$dev_inodes{ join '', ( stat $_ )[0,1] } = $_
for reverse sort @group_of_same_size_files_by_name;
next if scalar keys %dev_inodes == 1; # don't keep working if
there's nothing to compare
for my file ( values %dev_inodes )
{
do stuff to figure out which of the same-size files are duplicates...
}
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From joel.a.berger at gmail.com Sun Dec 29 11:32:38 2013
From: joel.a.berger at gmail.com (Joel Berger)
Date: Sun, 29 Dec 2013 13:32:38 -0600
Subject: [DFW.pm] disk-read buffering?
Message-ID:
Hi all,
As I am testing, the first time I run my script it takes significantly
longer than subsequent runs, soon afterwards. Is there some amount of
disk-buffering happening? Can we control this for a consistent outcome?
Thanks,
Joel Berger
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Mon Dec 30 06:57:55 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 30 Dec 2013 08:57:55 -0600
Subject: [DFW.pm] disk-read buffering?
In-Reply-To:
References:
Message-ID: <52C189F3.3090605@internetalias.net>
I'm surprised you're seeing a large difference; I observe the effects of
the hardware disk-based caching myself but I don't see a big difference
between run 1 and 2. I can't really turn that off, and wouldn't want to
because doing so would alter the real-world scenario of running your
code in the wild. Nevertheless, to keep this from happening to you, run
this before you run your Perl app:
find /dedup >/dev/null
This is precisely the reason why we are taking the best time out of 2
runs when each contestant's code is benchmarked. I will run the above
command every time before benchmarking any code to assure fairness.
--Tommy Butler
On 12/29/2013 01:32 PM, Joel Berger wrote:
> Hi all,
>
> As I am testing, the first time I run my script it takes significantly
> longer than subsequent runs, soon afterwards. Is there some amount of
> disk-buffering happening? Can we control this for a consistent outcome?
>
> Thanks,
>
> Joel Berger
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From robert.eaglestone at gmail.com Mon Dec 30 12:38:39 2013
From: robert.eaglestone at gmail.com (Rob Eaglestone)
Date: Mon, 30 Dec 2013 14:38:39 -0600
Subject: [DFW.pm] example deduplication code and full disclosure
In-Reply-To: <52B9D567.2030204@internetalias.net>
References: <52B9D567.2030204@internetalias.net>
Message-ID:
That's beautiful Perl, Tommy. I see I will have to replace my old
"Programming Perl" book with one updated for the 2000s (mine is dated 1996,
and shows Perl at version 5.003).
On Tue, Dec 24, 2013 at 12:41 PM, Tommy Butler wrote:
> Full disclosure: I'm not competing in the contest as John and I are
> hosting it and have written the code that generates the random dataset.
>
> However I wrote some example code that does work and which I'd like to
> share to help give others a gentle push if anyone is having trouble getting
> started.
>
> Feel free to steal/fork/laugh at the code as much as you like. The code
> isn't extensively commented but it is very readable. It's also simple and
> concise and makes use of CPAN modules, some of which use XS code to get
> performance gains -- which is within the rules for the "traditional Perl
> solution" competition category.
>
> One provision is that my code purposely does not solve all the problems.
> In particular it doesn't handle hard links. That's up to you to solve.
>
> https://github.com/tommybutler/dupfind
>
> --Tommy Butler
>
> _______________________________________________
> Dfw-pm mailing list
> Dfw-pm at pm.org
> http://mail.pm.org/mailman/listinfo/dfw-pm
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Mon Dec 30 13:22:49 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 30 Dec 2013 15:22:49 -0600
Subject: [DFW.pm] example deduplication code and full disclosure
In-Reply-To:
References: <52B9D567.2030204@internetalias.net>
Message-ID: <52C1E429.1000406@internetalias.net>
Wow! I'll take that compliment, sir! "Greater love no man hath than he
that complementeth the Perl of his friends" (or something like that, right?)
** *One thing to note for anyone trying out that code: it just got some
important bug fixes, so you'll want to git pull* **
--Tommy Butler
On 12/30/2013 02:38 PM, Rob Eaglestone wrote:
> That's beautiful Perl, Tommy. I see I will have to replace my old
> "Programming Perl" book with one updated for the 2000s (mine is dated
> 1996, and shows Perl at version 5.003).
>
>
> On Tue, Dec 24, 2013 at 12:41 PM, Tommy Butler
> > wrote:
>
> Full disclosure: I'm not competing in the contest as John and I
> are hosting it and have written the code that generates the random
> dataset.
>
> However I wrote some example code that does work and which I'd
> like to share to help give others a gentle push if anyone is
> having trouble getting started.
>
> Feel free to steal/fork/laugh at the code as much as you like.
> The code isn't extensively commented but it is very readable.
> It's also simple and concise and makes use of CPAN modules, some
> of which use XS code to get performance gains -- which is within
> the rules for the "traditional Perl solution" competition category.
>
> One provision is that my code purposely does not solve all the
> problems. In particular it doesn't handle hard links. That's up
> to you to solve.
>
> https://github.com/tommybutler/dupfind
>
> --Tommy Butler
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From tmetro+dfw-pm at gmail.com Mon Dec 30 14:23:58 2013
From: tmetro+dfw-pm at gmail.com (Tom Metro)
Date: Mon, 30 Dec 2013 17:23:58 -0500
Subject: [DFW.pm] disk-read buffering?
In-Reply-To:
References:
Message-ID: <52C1F27E.40104@gmail.com>
Joel Berger wrote:
> ...the first time I run my script it takes significantly
> longer than subsequent runs, soon afterwards.
Tommy Butler wrote:
> run this before you run your Perl app:
>
> find /dedup >/dev/null
That'll help cause the metadata to be cached, but not the file data.
> This is precisely the reason why we are taking the best time out of 2
> runs when each contestant's code is benchmarked. I will run the above
> command every time before benchmarking any code to assure fairness.
I thought there was going to be a server reboot/reset/rebuild between runs.
The closest thing to real-world is having a fully empty cache, but I
can't see any way that can be accomplished during development on a
shared server.
The next best thing for consistent results (so you can do relative
comparisons) would be seeding the cache, but using other tools will only
approximate the needs of your dedupe code. Probably your best bet when
testing is to just plan on running multiple times, and realize the the
first run will more closely approximate the competition run in terms of
actual time, and use subsequent runs for relative comparisons.
Ideally while testing you should be benchmarking small portions of your
code, so the cache will fill on the first run, and you have a good
chance they'll remain populated for several subsequent runs, despite
other users on the system hitting other files.
-Tom
--
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/
From tmetro+dfw-pm at gmail.com Mon Dec 30 14:28:56 2013
From: tmetro+dfw-pm at gmail.com (Tom Metro)
Date: Mon, 30 Dec 2013 17:28:56 -0500
Subject: [DFW.pm] what is a hard link,
and what should my deduper do with them?
In-Reply-To: <52BF98E5.8060505@internetalias.net>
References: <52BF98E5.8060505@internetalias.net>
Message-ID: <52C1F3A8.3020007@gmail.com>
Tommy Butler wrote:
> ...other hard links should be considered, as already
> stated in the rules, "files already deduped".
>
> SCENARIO:
>
> The three files below have identical content:
> /foo/bar/baz.txt -> ( inode 12345 )
> /foo/car/daz.txt -> ( inode 12345 )
> /foo/far/gaz.txt -> ( inode 67890 )
>
>
> OUTCOME:
>
> /foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt
> because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and
> because /foo/car/daz.txt is a hard link.
So then the output might look like:
/foo/bar/baz.txt /foo/far/gaz.txt
while /foo/car/daz.txt is simply eliminated from consideration and not
output at all?
The problem with this approach, if you are striving for a useful tool
and not just a programming exercise, is that you don't know which of the
aliases is the name most familiar to the user who will be reviewing the
report.
Another possibility might be to report hardlinks in a way that visually
groups them together, then any place one member of a hardlink would
appear in the output, you replace it with the group:
(/foo/bar/baz.txt /foo/car/daz.txt) /foo/far/gaz.txt
(With members of the group being sub-sorted asciibetically, and the
first member of the group being used as the key when sorting the overall
list of duplicates.)
But this is still not quite ideal. This implies that you ignore
collections of hardlinks that don't also have a duplicate file. Chances
are good if the user is interested in duplicates, they're also
interested to know about what hardlinks (aliases) exist.
Plus, most characters you choose for grouping could potentially be part
of the file name, although the same could be said for the space delimiters.
So instead, you could simply produce a report of hardlinks at the end,
and any place a file appears in a duplicate report that has multiple
aliases, you always show the asciibetically first name:
Duplicates:
/foo/bar/baz.txt /foo/far/gaz.txt
...
Aliases:
/foo/bar/baz.txt /foo/car/daz.txt
...
-Tom
--
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/
From dfwpm at internetalias.net Mon Dec 30 14:32:44 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 30 Dec 2013 16:32:44 -0600
Subject: [DFW.pm] disk-read buffering?
In-Reply-To: <52C1F27E.40104@gmail.com>
References:
<52C1F27E.40104@gmail.com>
Message-ID: <52C1F48C.8080900@internetalias.net>
On 12/30/2013 04:23 PM, Tom Metro wrote:
> I thought there was going to be a server reboot/reset/rebuild between
> runs.
It will.
> The closest thing to real-world is having a fully empty cache, but I
> can't see any way that can be accomplished during development on a
> shared server.
...and such should be the case, or as close to it as possible.
> Ideally while testing you should be benchmarking small portions of
> your code, so the cache will fill on the first run, and you have a
> good chance they'll remain populated for several subsequent runs,
> despite other users on the system hitting other files.
No need to worry too much about this. The server won't be 'shared' at
the time it's running the formal competition code benchmarks. It will
have been completely reverted to its state before the contest began.
Code will be cloned from github when it's time to run. Before each run,
the server will be rolled back.
It's wholly contained in a virtual machine for this reason. The
encasing hardware/host will be totally idle other than its task of
running the VM. We needed the ability to "reset" the contest server to
a pristine state for each contestant, and having a virtual machine made
perfect sense.
As advised from the beginning: code that depends on caching will be
self-limited for the above reasons.
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From tmetro+dfw-pm at gmail.com Mon Dec 30 14:41:56 2013
From: tmetro+dfw-pm at gmail.com (Tom Metro)
Date: Mon, 30 Dec 2013 17:41:56 -0500
Subject: [DFW.pm] zero length files
Message-ID: <52C1F6B4.4010807@gmail.com>
What about zero length files?
By one view, they are all duplicates of each other, as their content is
identical. By another view, the concept of duplication is moot, as they
have no content.
Likely most algorithms will treat all zero byte files as identical,
unless the code has a special case.
Should they be handled specially? If so, how? Ignored? Grouped together
in their own section of the report?
-Tom
--
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/
From dfwpm at internetalias.net Mon Dec 30 14:44:07 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 30 Dec 2013 16:44:07 -0600
Subject: [DFW.pm] zero length files
In-Reply-To: <52C1F6B4.4010807@gmail.com>
References: <52C1F6B4.4010807@gmail.com>
Message-ID: <52C1F737.1060308@internetalias.net>
For our purposes, they're dupes. Please report them as such.
--Tommy Butler
On 12/30/2013 04:41 PM, Tom Metro wrote:
> What about zero length files?
>
> By one view, they are all duplicates of each other, as their content is
> identical. By another view, the concept of duplication is moot, as they
> have no content.
>
> Likely most algorithms will treat all zero byte files as identical,
> unless the code has a special case.
>
> Should they be handled specially? If so, how? Ignored? Grouped together
> in their own section of the report?
>
> -Tom
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Mon Dec 30 15:26:32 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 30 Dec 2013 17:26:32 -0600
Subject: [DFW.pm] what is a hard link,
and what should my deduper do with them?
In-Reply-To: <52C1F3A8.3020007@gmail.com>
References: <52BF98E5.8060505@internetalias.net> <52C1F3A8.3020007@gmail.com>
Message-ID: <52C20128.8050600@internetalias.net>
On 12/30/2013 04:28 PM, Tom Metro wrote:
> Tommy Butler wrote:
>> ...other hard links should be considered, as already
>> stated in the rules, "files already deduped".
>>
>> SCENARIO:
>>
>> The three files below have identical content:
>> /foo/bar/baz.txt -> ( inode 12345 )
>> /foo/car/daz.txt -> ( inode 12345 )
>> /foo/far/gaz.txt -> ( inode 67890 )
>>
>>
>> OUTCOME:
>>
>> /foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt
>> because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and
>> because /foo/car/daz.txt is a hard link.
>
> So then the output might look like:
> /foo/bar/baz.txt /foo/far/gaz.txt
YES! :)
> while /foo/car/daz.txt is simply eliminated from consideration and not
> output at all?
Yep.
> The problem with this approach, if you are striving for a useful tool
> and not just a programming exercise, is that you don't know which of
> the aliases is the name most familiar to the user who will be
> reviewing the report.
And just when I've about finalized the output spec and created an output
file to put up on the git repo for diffing ... this. :-)
You are right, insofar as we are working to develop a useful tool and
not create throw-away code to use once for a competition. I won't pick
nits over the likelihood of a real-world scenario where hardlinks exist
in Joe User's music collection. However for the sake of simplicity
we're not going to require contestants to go this extra mile at this
time. Everyone is free to implement an output format that reports hard
link groupings and to do so for unredeemable "bonus" points. Should
anyone, like me, want a useful tool when they are done with their code,
they should strive to make it as robust and feature-ful as possible
without sacrificing too much performance. After all, there is a
winnings category for code that provides the most comprehensive feature set.
As promised recently, a finalized "expected" output format (against
which the product of each contestant's code will be diff'd) is
forthcoming. It seems like a glaring oversight that this wasn't part of
the original rules specification. I don't fault myself too much for
this given the fact that the output format is to be so simple and up to
this point everyone is expected to be coding against the problem and not
the output. Watch for that email later on this evening.
Thanks Tom, and thanks all!
--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Mon Dec 30 18:40:52 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Mon, 30 Dec 2013 20:40:52 -0600
Subject: [DFW.pm] Deduplication Hackathon: Formal Output Specification
Message-ID: <52C22EB4.8030303@internetalias.net>
For your deduplication hackathon code entry, the output of your Perl app
should be as follows:
1. Each grouping of duplicates should be sorted and printed out all on
one line, by filename, deliminated by a tab character.
2. The lines of output should be sorted.
3. The sort you should use for both the lines of output and the file
name groupings themselves is: sort { $a cmp $b }
4. Any output leading up to a delimiter of 30 dashes on its own line
will be ignored. Any output coming after a second line comprised of
30 dashes is also ignored. These delimiter lines are optional if
your output is solely comprised of the sorted results and nothing
else. Otherwise, use the space to prefix your results with status
messages or a status indicator (progress bar, etc), and optionally
follow up your results with a summary of what your code
encountered. See example at bottom of message.
Your code can actually output whatever it wants, so long as there is a
way to call it where it produces output according to the spec as
outlined above.
An example is provided in the lines below, and in the screenshot that
follows. This output is generated by the code as found on github at
https://github.com/tommybutler/dupfind
In just a few minutes I will put up on (github at the same url) the
correct output for the reference data that is currently on the contest
server under /dedup. */Please take time to compare your code output to
the output of the "reference design" code on github. If your output is
not identical, then you will be disqualified for producing incorrect
results. /*If you believe the reference design is incorrect, then
please submit a bug report and/or a patch!!
--Tommy Butler
------------------------------------------------------------------------
$ ./dupfind --format robot --dir .
** SCANNING ALL FILES
** CHECKSUMMING SIZE DUPLICATES
** DISPLAYING OUTPUT
------------------------------
./.git/logs/HEAD ./.git/logs/refs/heads/master
./.git/refs/heads/master ./.git/refs/remotes/origin/master
./bar ./baz ./foo
------------------------------
** TOTAL SCANNED: 86
** TOTAL DUPES: 4
** SCAN TIME: 0.00824308 wallclock secs ( 0.00 usr + 0.01 sys
= 0.01 CPU)
** DELETION TIME: 0
------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: edihfghd.png
Type: image/png
Size: 490429 bytes
Desc: not available
URL:
From joakim.lagerqvist at gmail.com Tue Dec 31 02:19:11 2013
From: joakim.lagerqvist at gmail.com (Joakim Lagerqvist)
Date: Tue, 31 Dec 2013 21:19:11 +1100
Subject: [DFW.pm] Deduplication Hackathon: Formal Output Specification
In-Reply-To: <52C22EB4.8030303@internetalias.net>
References: <52C22EB4.8030303@internetalias.net>
Message-ID:
Hello Tommy,
On Tue, Dec 31, 2013 at 1:40 PM, Tommy Butler wrote:
> *Please take time to compare your code output to the output of the
> "reference design" code on github. If your output is not identical, then
> you will be disqualified for producing incorrect results. *
>
In your reference design "human" output, you have included the xxHash
value, is this needed for the contest? If another digest/method has been
used to identify the duplicates, it will not match up.
Cheers and happy new year,
Joakim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dfwpm at internetalias.net Tue Dec 31 12:37:12 2013
From: dfwpm at internetalias.net (Tommy Butler)
Date: Tue, 31 Dec 2013 14:37:12 -0600
Subject: [DFW.pm] Deduplication Hackathon: Formal Output Specification
In-Reply-To:
References: <52C22EB4.8030303@internetalias.net>
Message-ID: <52C32AF8.6070709@internetalias.net>
You may use any hashing/mapping algorithm you like; xxhash was just a
personal choice given it's emphasis on speed.
The "robot" format of output is what must match up with your own
output. The "human" format is strictly optional, and if you choose to
create a human-readable output format option for your own code, it can
look however you want it to.
Remember to email your ssh key to me if you want server access and the
ability to test your code against the reference data.
--Tommy Butler
On 12/31/2013 04:19 AM, Joakim Lagerqvist wrote:
> Hello Tommy,
>
> On Tue, Dec 31, 2013 at 1:40 PM, Tommy Butler > wrote:
>
> */Please take time to compare your code output to the output of
> the "reference design" code on github. If your output is not
> identical, then you will be disqualified for producing incorrect
> results. /*
>
>
> In your reference design "human" output, you have included the xxHash
> value, is this needed for the contest? If another digest/method has
> been used to identify the duplicates, it will not match up.
>
> Cheers and happy new year,
> Joakim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: