From dfwpm at internetalias.net Tue Dec 10 15:29:29 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Tue, 10 Dec 2013 17:29:29 -0600 Subject: [DFW.pm] Meeting Announcement - Hackathon - Bring Your A Game Message-ID: <52A7A3D9.2090004@internetalias.net> Fellow Perl Mongers, Tomorrow night at _*7 PM*_ we meet for the month of December. We will be holding a hackathon of sorts, a competitive educational event. The topic and objective of the competition will be announced at the beginning of the meeting, which will also be broadcast live in a Google Hangout which you can join remotely (link and info for the hangout will be sent to this mailing list about an hour or two before the meeting). Instructions and assistance will be provided for beginners or anyone else who needs help. Participation is not mandatory; you can just watch or fly wingman for your favorite competitor if you want. Bring with you: * Laptop * Friend, colleague, or someone you mentor * Google hangouts browser plugin or mobile app (please install it before the meeting) * github account (please set this up before the meeting) * SSH client ready to connect to a Linux server (windows users can use PuTTY for free) * filezilla or your SFTP software of choice * Your Mad Perl skills! Location Info: 2995 Ladybird Lane, Dallas, TX www.dallasmakerspace.org (214) 699-6537 See you at 7 PM! --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Tue Dec 10 19:06:18 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Tue, 10 Dec 2013 21:06:18 -0600 Subject: [DFW.pm] Meeting Announcement - Hackathon - Bring Your A Game In-Reply-To: <52A7A3D9.2090004@internetalias.net> References: <52A7A3D9.2090004@internetalias.net> Message-ID: <52A7D6AA.6030806@internetalias.net> If you're going to participate in tomorrow's contest and you will require a perlbrew of your own, please let me know as soon as possible and I'll arrange a brew with v5.18.1 x64. If you aren't familiar with perlbrew and don't anticipate needing an isolated Perl environment for yourself and/or are content to use the default system Perl, you can safely disregard this message. If anyone feels this puts contestants on uneven footing, rest assured that we'll re-run code benchmarks against the same Perl if there are any "close" results. --Tommy Butler On 12/10/2013 05:29 PM, Tommy Butler wrote: > Fellow Perl Mongers, > > Tomorrow night at _*7 PM*_ we meet for the month of December. We will > be holding a hackathon of sorts, a competitive educational event. -------------- next part -------------- An HTML attachment was scrubbed... URL: From noreply-f3605238 at plus.google.com Wed Dec 11 12:06:30 2013 From: noreply-f3605238 at plus.google.com (Tommy Butler (Google+)) Date: Wed, 11 Dec 2013 12:06:30 -0800 (PST) Subject: [DFW.pm] Tommy Butler invited you to DFW Perl Mongers Hackathon References: Message-ID: Tommy Butler invited you to DFW Perl Mongers Hackathon Wed, December 11, 7:00 PM CST Rob Hoelz, Mark Jason Dominus, Joel Bernstein and 120 more are invited View Invitation: https://plus.google.com/_/notifications/ngemlink?&emid=CLiRzJ78qLsCFUjjQAodcSoAAA&path=%2Fevents%2Fc84tv7hiosleru73tm3tm5el0k0%3Fgpinv%3DAMIXal9uIa-qOfX1wiKDU_jSPRMMDAX0UKZ_t_UHSY05ePoxcfqKJ1TDDqiXGc8wP2r5PSojHkNC0LhSNtWKcUp6syyLhZBT-CY7B6GjS45xRQ4sufyMblY%26gpsrc%3Dgpev0&dt=1386792391250&uob=14 Join us online for the launch of our DFW.pm "winter of code" hackathon competition (in which you too are invited to particpate). ?The meeting proper will happen in person at 7 pm tonight in our usual meetingplace: 2995 Ladybird Lane, Dallas, TX www.dallasmakerspace.org (214) 699-6537 A presentation will be given to reveal the objective of the competition and assistance will be provided in setting up access to the git repository and code contest server. Bring a laptop, bring a friend, and bring your Perl skills! ?See you tonight! ** Please have your google hangouts browser plugin or mobile app installed and ready to go. This notification was sent to dfw-pm at pm.org; Go to your notification delivery settings to update your address: https://plus.google.com/_/notifications/ngemlink?&emid=CLiRzJ78qLsCFUjjQAodcSoAAA&path=%2Fsettings%2Fplus&dt=1386792391250&uob=14 Manage subscriptions to change what emails you receive from Google+: https://plus.google.com/_/notifications/ngemlink?&emid=CLiRzJ78qLsCFUjjQAodcSoAAA&path=%2Fsettings%2Fplus&dt=1386792391250&uob=14 Google Inc., 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Wed Dec 11 20:16:10 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Wed, 11 Dec 2013 22:16:10 -0600 Subject: [DFW.pm] NY.pm joining our hackathon, and slides from tonight's meeting In-Reply-To: <52A7D6AA.6030806@internetalias.net> References: <52A7A3D9.2090004@internetalias.net> <52A7D6AA.6030806@internetalias.net> Message-ID: <52A9388A.8050408@internetalias.net> The slides from tonight's meeting can be viewed --> here David Golden from NY.pm is extending our contest to his own Perl Mongers group so they can compete! To get your ssh login set up on the competition server and for other setup assistance, send an email to dfwpm at internetalias dot com Thank you all for your participation. This is going to be fun! --Tommy Butler On 12/10/2013 09:06 PM, Tommy Butler wrote: > If you're going to participate in tomorrow's contest and you will > require a perlbrew of your own, please let me know as soon as possible > and I'll arrange a brew with v5.18.1 x64. > > If you aren't familiar with perlbrew and don't anticipate needing an > isolated Perl environment for yourself and/or are content to use the > default system Perl, you can safely disregard this message. > > If anyone feels this puts contestants on uneven footing, rest assured > that we'll re-run code benchmarks against the same Perl if there are > any "close" results. > > --Tommy Butler > > On 12/10/2013 05:29 PM, Tommy Butler wrote: >> Fellow Perl Mongers, >> >> Tomorrow night at _*7 PM*_ we meet for the month of December. We >> will be holding a hackathon of sorts, a competitive educational event. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Thu Dec 12 13:53:59 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Thu, 12 Dec 2013 15:53:59 -0600 Subject: [DFW.pm] Hackathon Rules and Participation Message-ID: <52AA3077.9060800@internetalias.net> /*Sorry for the length of the email*/, but being a formal contest (and one in which increasing interest is growing even outside DFW), I need to clarify some things for people who didn't make it to the meeting either on or off line last night. Here goes... The DFW Perl Mongers Winter of Code Deduplication Hackathon Participation Any Perl Monger anywhere may participate so long as he/she is vouched for by their PM group leader, prominent Perl community member, or CPAN author with a module released prior to this contest. Hackathon Server Accounts On a testing/development contest server is being provided. Everyone who wants to participate will get an SSH login and optionally a perlbrew if they ask for it. They can also install their own brew. Anyone who wants an account should send their public SSH key to dfwpm_at_internetalias_dot_net. Password-based logins won't be allowed. This server will give everyone the chance to develop their code for 1 month in the same environment in which it will be benchmarked during the formal head-to-head contest on January 8th (which will be broadcast live in a Google Hangout as usual, so physical presence at the Dallas Makerspace isn't necessary). 1GB disk space can be consumed per participant; space consumption will be monitored as will bandwidth consumption. ("Don't be a jerk"). Contestants should rely on github for code storage because I will wipe and recreate the server before the actual contest. Disk storage should therefore be considered volatile and git should be leveraged as the mechanism for data and code that contestants want to be persistent across the server rebuild. Environment The server will be running Debian Linux 7, stable branch Wheezy. It will be hosted on port 2222 at perl.atrixnet.com and a security lockout mechanism is in place for four failed logins in a row (i.e.- don't try to log in before I set up your key). As stated above, a full rebuild of the server will happen a couple days or so before the live contest at next month's meeting on humpday January 8th in order to ensure fairness and prevent foul play. When the contest server comes back online I will restore everyone's code via the cloning of their repo, but their system logins will not be restored -- no one will have access to the competition server at that time except the judges (Tommy Butler and John Fields). David Golden of NY.pm and Patrick Michaud of our local group are honorary judges. Conduct In the spirit of our community I only ask that no one do or try to do anything unethical, malicious, unfair, or abusive on the server -- including being a resource hog. Basically any rules that apply at a YAPC event apply to this hackathon, as do the dictates of common sense and decency :) Test Data The test data will be on a read-only volume mounted read-only on /dedup. The deduplication code from each contestant is simply required to accurately detect all duplicate files randomly generated in a 100 gigabyte mass of also randomly generated files and directories. The individual files and directories will have random names as well. Detect the duplicate files -- It's that simple. The volume will have both symlinks and hardlinks and code will need to correctly handle that. If code relies on heuristics of the data volume in order to achieve performance improvements, the author of such code will be disappointed; the random data will be randomly regenerated again before the contest and will not have the same number of symlinks/hardlinks, files, same filenames, file sizes, directories or directory depths. All contestants will be running their code against the same data volume. Contest Rules ...Are founded on the information in the slide presentation . All code will be tested against the same Perl (the latest available stable version before the contest, likely 5.18.0) and the best-out-of-two benchmark time will be used for each participant (because of hardware-based CPU and disk caching that we can't prevent). No code should write to disk or in-memory filesystems such as /run/shm. If we catch the code doing writes, it is automatic grounds for disqualification. This is to prevent results-caching between runs. Participants should disclose ahead of time what top-level CPAN modules they need installed for the correct operation of their code. The code and contestants should conform to the rules set forth in the slides from last night's Perl Mongers meeting. The code will be reviewed by the judge panel prior to execution. Unintelligible or obfuscated code won't be accepted. Read the slides for further information. Rules are subject to amendment by a majority vote of the judging panel, in the event that rules must be modified to insure fairness or fix problems that arise which prevent smooth, convenient execution of the contest. Your suggestions are always welcome. Please let us know your thoughts and remember to mail in your public SSH key to dfwpm_at_internetalias_dot_net if you are going to participate. --Tommy Butler, John Fields -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Wed Dec 18 14:50:01 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Wed, 18 Dec 2013 16:50:01 -0600 Subject: [DFW.pm] Hackathon Rules and Participation In-Reply-To: <52AA3077.9060800@internetalias.net> References: <52AA3077.9060800@internetalias.net> Message-ID: <52B22699.502@internetalias.net> UPDATE: now participating in addition to DFW.pm are members of the *Philadelphia*, *New York*, and *Atlanta* Perl Mongers groups. We're nation-wide. I'd like to see some international participation too, so please spread the word and post this link on your blog/social media stream: http://perlmonks.org/?node_id=1067570 PS - Today someone requested that emacs be installed on the free-to-use dev/contest server. Alas, I obliged. If anyone needs a particular package and/or wants to discuss it off-list, you can just email dfwpm at internetalias dot net. PPS - Reminder that formal contest rules are available at http://dfw.pm.org --Tommy Butler On 12/12/2013 03:53 PM, Tommy Butler wrote: > /*Sorry for the length of the email*/, but being a formal contest (and > one in which increasing interest is growing even outside DFW), I need > to clarify some things for people who didn't make it to the meeting > either on or off line last night. > > Here goes... > > > The DFW Perl Mongers Winter of Code Deduplication Hackathon > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Wed Dec 18 15:54:48 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Wed, 18 Dec 2013 17:54:48 -0600 Subject: [DFW.pm] Dedup Contest - SSH Access Tip In-Reply-To: References: Message-ID: <52B235C8.3030909@internetalias.net> This tip comes from one of our contest participants: /"Just in case it would help others, here is the chunk I just added to my ~/.ssh/config file, to make sure that I never err with the wrong port number://"/ # Dallas Fort Worth Perl Mongers - disk deduplication contest 2013-12-18 Host dfw Hostname perl.atrixnet.com Port 2222 User PUT YOUR USERNAME HERE PreferredAuthentications publickey IdentityFile ~/.ssh/id_rsa IdentitiesOnly yes --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Thu Dec 19 08:04:05 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Thu, 19 Dec 2013 10:04:05 -0600 Subject: [DFW.pm] OT - Perl/Sysadmin opening at my company Message-ID: <52B318F5.4040700@internetalias.net> As is our policy, we don't accept "recruiter spam" on our list, but share meaningful professional networking leads. My company in Irving, TX is looking to fill a position in the near future for someone with Unix experience who has at least basic Perl scripting skills or better. Interested parties follow up to dfwpm at internetalias dot net. --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Thu Dec 19 14:55:00 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Thu, 19 Dec 2013 16:55:00 -0600 Subject: [DFW.pm] Hackathon Just Got Real Message-ID: <52B37944.7050009@internetalias.net> It's kind of official now. Our hackathon competition just made the front page news feed on perl.org --Tommy Butler, John Fields -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Fri Dec 20 08:55:25 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Fri, 20 Dec 2013 10:55:25 -0600 Subject: [DFW.pm] A Warm Welcome & I/O Niceness Tip Message-ID: <52B4767D.1000705@internetalias.net> A warm welcome to our newest list members, friends from other Perl Monger groups around the globe. Recently joining us as part of the hackathon are some of our neighbors from New York City, Brooklyn, Philadelphia, Atlanta, Chicago, and Sydney Australia. If you still need an ssh account set up on the competition server, check out dfw.pm.org for details. And finally, a tip for those running code that does rapid-fire reading on the filesystem: please consider running it through ionice. It's the nice thing to do. Example (from the shell prompt): *$ nice -n 19 ionice -c2 -n7 ./your_perl_program.pl* --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Mon Dec 23 11:28:57 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 23 Dec 2013 13:28:57 -0600 Subject: [DFW.pm] hard links are not dupes! Message-ID: <52B88EF9.5080401@internetalias.net> We've had a lot of recent list sign-ups lately in relation to the hackathon, so this is just a reminder of things already discussed in our last meeting which you may have missed. * You are deduping an ext4 filesystem and that's all we're saying about it. You have server access so if you want to poke it, you can ;-) * /*The test data you are working with is peppered randomly with hard and soft links.*/ * Hard links to files are not duplicates. They point to the same underlying storage and should therefore be considered as files already de-duped. We are aware that technically hard links _are_ files in and of themselves, but they are metadata and not storage. You'll have to decide how to optimize for this. It's a very tricky tradeoff. Don't base your decision on how frequently links occur in the test dataset; the final dataset will not be identical and is not even guaranteed to be similar. * Symlinks are also NOT duplicates. * Your code is indeed going to face /directory/ symlinks as well as file links, so you'll need to take care not to get stuck in directory recursion loops. -- Tommy Butler, John Fields -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Mon Dec 23 17:18:21 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 23 Dec 2013 19:18:21 -0600 Subject: [DFW.pm] contest optimization strategy clarifications - BRING IT Message-ID: <52B8E0DD.6010109@internetalias.net> Hi, With all the recent list sign-ups we've had some questions raised off-list. I'd like to address these one time, for the group, instead of one-by-one. The questions: 1. I want to write to a *destroyed-on-exit* in-memory database (SQLite) or *destroyed-on-exit *tied hash (BDB) = THIS IS OK 2. My code depends on modules that write temp files that persist between executions = THIS IS BENT* 3. My code requires a C compiler on the system = THIS IS BENT** If your code/design looks like item number 1 in the list above, we're not so concerned about your tied hash writing to the filesystem or /dev/shm because we've decided to completely roll back the server before every code execution that happens. Yup. It will take mere seconds to roll back the server to its state before any code ran. Because of this, we still discourage other kinds of disk-writes; we'd rather not deviate from rules around which existing code has already been designed. We're just going to make darn sure that any sneaky disk writes are completely non-existent between your test runs. Fairness must be assured. **Now if your code falls into a "THIS IS BENT" category, you're still welcome to compete, and even win, but _you'll be doing so in a separate competition category_, simply called "rule benders". Why allow rule benders? Because we still want to see how fast things can go. We asked for a Pure Perl solution, with the only exception being that your code could depend on XS-based (that means compiled C extensions) modules from the CPAN that were released /prior/ to the beginning of the hackathon. Strangely enough, those rules get blurry when you start using CPAN modules that depend on Inline::C or that otherwise need access to a C compiler on the system at the time when the code runs. To John and I, this is a type of code optimization that isn't based on Perl, but instead based on C. You can argue how much of it is Perl and how much of it is C in terms of lines of code in one language or the other, but you can't really easily prove how much of the performance gains were Perl-based and how much of them were XS-based) and we aren't going to NYTprof your code just to find out. What it comes down to in terms of fairness is that Perl code which might have lost the competition on its own will then have won by virtue of the inclusion of low-level optimizations that just aren't in keeping with the spirit of the contest as it was intended -- but which are still awesome! We'd like to see what can be achieved through this kind of go-baby-go, nitrous-oxide-injected, turbocharged Perl, but in your own category of competition. So go for it. If you want to be a bender, let's see what you've got. Bring it, benders. --Tommy Butler, John Fields -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Tue Dec 24 10:41:43 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Tue, 24 Dec 2013 12:41:43 -0600 Subject: [DFW.pm] example deduplication code and full disclosure Message-ID: <52B9D567.2030204@internetalias.net> Full disclosure: I'm not competing in the contest as John and I are hosting it and have written the code that generates the random dataset. However I wrote some example code that does work and which I'd like to share to help give others a gentle push if anyone is having trouble getting started. Feel free to steal/fork/laugh at the code as much as you like. The code isn't extensively commented but it is very readable. It's also simple and concise and makes use of CPAN modules, some of which use XS code to get performance gains -- which is within the rules for the "traditional Perl solution" competition category. One provision is that my code purposely does not solve all the problems. In particular it doesn't handle hard links. That's up to you to solve. https://github.com/tommybutler/dupfind --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Thu Dec 26 19:24:35 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Thu, 26 Dec 2013 21:24:35 -0600 Subject: [DFW.pm] But what about Perl 6? Message-ID: <52BCF2F3.801@internetalias.net> It was mentioned to me tonight that a Perl 6 competition category seemed to be missing from the deduplication hackathon competition. Indeed. If you want to submit an entry in Perl 6, it will eagerly be accepted and pitted against any other Perl 6 submissions. If you are the only one to submit something in Perl 6, you win the Perl 6 competition category. *Is there an**yone who would be interested in this?* --Tommy Butler, John Fields -------------- next part -------------- An HTML attachment was scrubbed... URL: From robertbrucegray3 at gmail.com Thu Dec 26 20:18:39 2013 From: robertbrucegray3 at gmail.com (Bruce Gray) Date: Thu, 26 Dec 2013 22:18:39 -0600 Subject: [DFW.pm] But what about Perl 6? In-Reply-To: <52BCF2F3.801@internetalias.net> References: <52BCF2F3.801@internetalias.net> Message-ID: I already have a Perl 6 solution coded. On Dec 26, 2013 9:24 PM, "Tommy Butler" wrote: > It was mentioned to me tonight that a Perl 6 competition category seemed > to be missing from the deduplication hackathon competition. Indeed. > > If you want to submit an entry in Perl 6, it will eagerly be accepted and > pitted against any other Perl 6 submissions. If you are the only one to > submit something in Perl 6, you win the Perl 6 competition category. *Is > there an**yone who would be interested in this?* > > --Tommy Butler, John Fields > > _______________________________________________ > Dfw-pm mailing list > Dfw-pm at pm.org > http://mail.pm.org/mailman/listinfo/dfw-pm > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Thu Dec 26 20:25:28 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Thu, 26 Dec 2013 22:25:28 -0600 Subject: [DFW.pm] But what about Perl 6? In-Reply-To: References: <52BCF2F3.801@internetalias.net> Message-ID: <52BD0138.7070403@internetalias.net> AWESOME =) Do you want a custom-compiled Rakudo, or is the current stable release (according to the Debian Rakudo Maintainers) ok with you? It's version 0.1~2012.01-1 All opinions welcome. --Tommy Butler On 12/26/2013 10:18 PM, Bruce Gray wrote: > > I already have a Perl 6 solution coded. > > On Dec 26, 2013 9:24 PM, "Tommy Butler" > wrote: > > It was mentioned to me tonight that a Perl 6 competition category > seemed to be missing from the deduplication hackathon > competition. Indeed. > > If you want to submit an entry in Perl 6, it will eagerly be > accepted and pitted against any other Perl 6 submissions. If you > are the only one to submit something in Perl 6, you win the Perl 6 > competition category. *Is there an**yone who would be interested > in this?* > > --Tommy Butler, John Fields > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Sat Dec 28 09:55:28 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Sat, 28 Dec 2013 11:55:28 -0600 Subject: [DFW.pm] server advisory, and good manners Message-ID: <52BF1090.1000601@internetalias.net> I am going to try out a version of my code that is multi-threaded (and no, I'm not competing with you; I'm just working on speeding up the code that will be used to validate the results your own code produces). The code is still on github in the aforementioned location. The reason I'm sending this message is because the multi-threaded version may run the server out of memory and/or lock it up. I have not been able to fix that yet. It works fine on 30GB of data, but this will be the first time I try it on 100GB. I will reboot the server if I lock it up, with my apologies. I will send out another message right before I run the code. If you have code that you know puts the stability of the server at risk, it's not a problem. Just mail the list first before you run it so others can be aware and so I can reboot if needed. It may also be a good idea to advise the list before doing a big test run so you don't throw other benchmarks off for logged-in users who are also executing code. You can check if other users are logged in by typing the `w` command. Use the `top` or `htop` commands to see if other users are testing code. Just look for "perl" or "rakudo" in the process table, or check the processes based on username. If you want to send a broadcast message to other users on the server before you run your code, do so like this: echo "hello all, I am about to run my code, --Jane Contestant" | wall --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Sat Dec 28 12:35:18 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Sat, 28 Dec 2013 14:35:18 -0600 Subject: [DFW.pm] server advisory, and good manners In-Reply-To: <52BF1090.1000601@internetalias.net> References: <52BF1090.1000601@internetalias.net> Message-ID: <52BF3606.7090305@internetalias.net> OK, I'm running the code now. No one is logged in at present. --Tommy Butler On 12/28/2013 11:55 AM, Tommy Butler wrote: > I am going to try out a version of my code that is multi-threaded (and > no, I'm not competing with you; I'm just working on speeding up the > code that will be used to validate the results your own code > produces). The code is still on github in the aforementioned > location. The reason I'm sending this message is because the > multi-threaded version may run the server out of memory and/or lock it > up. I have not been able to fix that yet. It works fine on 30GB of > data, but this will be the first time I try it on 100GB. I will > reboot the server if I lock it up, with my apologies. I will send out > another message right before I run the code. > > If you have code that you know puts the stability of the server at > risk, it's not a problem. Just mail the list first before you run it > so others can be aware and so I can reboot if needed. > > It may also be a good idea to advise the list before doing a big test > run so you don't throw other benchmarks off for logged-in users who > are also executing code. > > You can check if other users are logged in by typing the `w` command. > Use the `top` or `htop` commands to see if other users are testing > code. Just look for "perl" or "rakudo" in the process table, or check > the processes based on username. > > If you want to send a broadcast message to other users on the server > before you run your code, do so like this: > > echo "hello all, I am about to run my code, --Jane Contestant" | wall > > --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Sat Dec 28 19:37:09 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Sat, 28 Dec 2013 21:37:09 -0600 Subject: [DFW.pm] what is a hard link, and what should my deduper do with them? Message-ID: <52BF98E5.8060505@internetalias.net> What is a hard link? --> http://www.linfo.org/hard_link.html Because of the nature of hard links, there's no way to know which hard link existed first or which one is to be considered the "original" file, because they point to the same underlying storage which has only one lastmod/atime/mtime timestamp set. As such, the official rule on the matter is that the asciibetically-first fully-qualified file name will be considered the original, while the other hard links should be considered, as already stated in the rules, "files already deduped". The reason for this is strictly for output and reporting consistencies. (We need this to maintain a standard baseline output format, which I'll set forth in another email coming soon). SCENARIO: The three files below have identical content: /foo/bar/baz.txt -> ( inode 12345 ) /foo/car/daz.txt -> ( inode 12345 ) /foo/far/gaz.txt -> ( inode 67890 ) OUTCOME: /foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and because /foo/car/daz.txt is a hard link. CODE: I'm doing it like this. This is just an unoptimized example. TIMTOWTDI. # this will automatically throw out all but one hardlink, with the only surviving # file name being the first asciibetically-sorted entry $dev_inodes{ join '', ( stat $_ )[0,1] } = $_ for reverse sort @group_of_same_size_files_by_name; next if scalar keys %dev_inodes == 1; # don't keep working if there's nothing to compare for my file ( values %dev_inodes ) { do stuff to figure out which of the same-size files are duplicates... } --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.a.berger at gmail.com Sun Dec 29 11:32:38 2013 From: joel.a.berger at gmail.com (Joel Berger) Date: Sun, 29 Dec 2013 13:32:38 -0600 Subject: [DFW.pm] disk-read buffering? Message-ID: Hi all, As I am testing, the first time I run my script it takes significantly longer than subsequent runs, soon afterwards. Is there some amount of disk-buffering happening? Can we control this for a consistent outcome? Thanks, Joel Berger -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Mon Dec 30 06:57:55 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 30 Dec 2013 08:57:55 -0600 Subject: [DFW.pm] disk-read buffering? In-Reply-To: References: Message-ID: <52C189F3.3090605@internetalias.net> I'm surprised you're seeing a large difference; I observe the effects of the hardware disk-based caching myself but I don't see a big difference between run 1 and 2. I can't really turn that off, and wouldn't want to because doing so would alter the real-world scenario of running your code in the wild. Nevertheless, to keep this from happening to you, run this before you run your Perl app: find /dedup >/dev/null This is precisely the reason why we are taking the best time out of 2 runs when each contestant's code is benchmarked. I will run the above command every time before benchmarking any code to assure fairness. --Tommy Butler On 12/29/2013 01:32 PM, Joel Berger wrote: > Hi all, > > As I am testing, the first time I run my script it takes significantly > longer than subsequent runs, soon afterwards. Is there some amount of > disk-buffering happening? Can we control this for a consistent outcome? > > Thanks, > > Joel Berger -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.eaglestone at gmail.com Mon Dec 30 12:38:39 2013 From: robert.eaglestone at gmail.com (Rob Eaglestone) Date: Mon, 30 Dec 2013 14:38:39 -0600 Subject: [DFW.pm] example deduplication code and full disclosure In-Reply-To: <52B9D567.2030204@internetalias.net> References: <52B9D567.2030204@internetalias.net> Message-ID: That's beautiful Perl, Tommy. I see I will have to replace my old "Programming Perl" book with one updated for the 2000s (mine is dated 1996, and shows Perl at version 5.003). On Tue, Dec 24, 2013 at 12:41 PM, Tommy Butler wrote: > Full disclosure: I'm not competing in the contest as John and I are > hosting it and have written the code that generates the random dataset. > > However I wrote some example code that does work and which I'd like to > share to help give others a gentle push if anyone is having trouble getting > started. > > Feel free to steal/fork/laugh at the code as much as you like. The code > isn't extensively commented but it is very readable. It's also simple and > concise and makes use of CPAN modules, some of which use XS code to get > performance gains -- which is within the rules for the "traditional Perl > solution" competition category. > > One provision is that my code purposely does not solve all the problems. > In particular it doesn't handle hard links. That's up to you to solve. > > https://github.com/tommybutler/dupfind > > --Tommy Butler > > _______________________________________________ > Dfw-pm mailing list > Dfw-pm at pm.org > http://mail.pm.org/mailman/listinfo/dfw-pm > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Mon Dec 30 13:22:49 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 30 Dec 2013 15:22:49 -0600 Subject: [DFW.pm] example deduplication code and full disclosure In-Reply-To: References: <52B9D567.2030204@internetalias.net> Message-ID: <52C1E429.1000406@internetalias.net> Wow! I'll take that compliment, sir! "Greater love no man hath than he that complementeth the Perl of his friends" (or something like that, right?) ** *One thing to note for anyone trying out that code: it just got some important bug fixes, so you'll want to git pull* ** --Tommy Butler On 12/30/2013 02:38 PM, Rob Eaglestone wrote: > That's beautiful Perl, Tommy. I see I will have to replace my old > "Programming Perl" book with one updated for the 2000s (mine is dated > 1996, and shows Perl at version 5.003). > > > On Tue, Dec 24, 2013 at 12:41 PM, Tommy Butler > > wrote: > > Full disclosure: I'm not competing in the contest as John and I > are hosting it and have written the code that generates the random > dataset. > > However I wrote some example code that does work and which I'd > like to share to help give others a gentle push if anyone is > having trouble getting started. > > Feel free to steal/fork/laugh at the code as much as you like. > The code isn't extensively commented but it is very readable. > It's also simple and concise and makes use of CPAN modules, some > of which use XS code to get performance gains -- which is within > the rules for the "traditional Perl solution" competition category. > > One provision is that my code purposely does not solve all the > problems. In particular it doesn't handle hard links. That's up > to you to solve. > > https://github.com/tommybutler/dupfind > > --Tommy Butler > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmetro+dfw-pm at gmail.com Mon Dec 30 14:23:58 2013 From: tmetro+dfw-pm at gmail.com (Tom Metro) Date: Mon, 30 Dec 2013 17:23:58 -0500 Subject: [DFW.pm] disk-read buffering? In-Reply-To: References: Message-ID: <52C1F27E.40104@gmail.com> Joel Berger wrote: > ...the first time I run my script it takes significantly > longer than subsequent runs, soon afterwards. Tommy Butler wrote: > run this before you run your Perl app: > > find /dedup >/dev/null That'll help cause the metadata to be cached, but not the file data. > This is precisely the reason why we are taking the best time out of 2 > runs when each contestant's code is benchmarked. I will run the above > command every time before benchmarking any code to assure fairness. I thought there was going to be a server reboot/reset/rebuild between runs. The closest thing to real-world is having a fully empty cache, but I can't see any way that can be accomplished during development on a shared server. The next best thing for consistent results (so you can do relative comparisons) would be seeding the cache, but using other tools will only approximate the needs of your dedupe code. Probably your best bet when testing is to just plan on running multiple times, and realize the the first run will more closely approximate the competition run in terms of actual time, and use subsequent runs for relative comparisons. Ideally while testing you should be benchmarking small portions of your code, so the cache will fill on the first run, and you have a good chance they'll remain populated for several subsequent runs, despite other users on the system hitting other files. -Tom -- Tom Metro The Perl Shop, Newton, MA, USA "Predictable On-demand Perl Consulting." http://www.theperlshop.com/ From tmetro+dfw-pm at gmail.com Mon Dec 30 14:28:56 2013 From: tmetro+dfw-pm at gmail.com (Tom Metro) Date: Mon, 30 Dec 2013 17:28:56 -0500 Subject: [DFW.pm] what is a hard link, and what should my deduper do with them? In-Reply-To: <52BF98E5.8060505@internetalias.net> References: <52BF98E5.8060505@internetalias.net> Message-ID: <52C1F3A8.3020007@gmail.com> Tommy Butler wrote: > ...other hard links should be considered, as already > stated in the rules, "files already deduped". > > SCENARIO: > > The three files below have identical content: > /foo/bar/baz.txt -> ( inode 12345 ) > /foo/car/daz.txt -> ( inode 12345 ) > /foo/far/gaz.txt -> ( inode 67890 ) > > > OUTCOME: > > /foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt > because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and > because /foo/car/daz.txt is a hard link. So then the output might look like: /foo/bar/baz.txt /foo/far/gaz.txt while /foo/car/daz.txt is simply eliminated from consideration and not output at all? The problem with this approach, if you are striving for a useful tool and not just a programming exercise, is that you don't know which of the aliases is the name most familiar to the user who will be reviewing the report. Another possibility might be to report hardlinks in a way that visually groups them together, then any place one member of a hardlink would appear in the output, you replace it with the group: (/foo/bar/baz.txt /foo/car/daz.txt) /foo/far/gaz.txt (With members of the group being sub-sorted asciibetically, and the first member of the group being used as the key when sorting the overall list of duplicates.) But this is still not quite ideal. This implies that you ignore collections of hardlinks that don't also have a duplicate file. Chances are good if the user is interested in duplicates, they're also interested to know about what hardlinks (aliases) exist. Plus, most characters you choose for grouping could potentially be part of the file name, although the same could be said for the space delimiters. So instead, you could simply produce a report of hardlinks at the end, and any place a file appears in a duplicate report that has multiple aliases, you always show the asciibetically first name: Duplicates: /foo/bar/baz.txt /foo/far/gaz.txt ... Aliases: /foo/bar/baz.txt /foo/car/daz.txt ... -Tom -- Tom Metro The Perl Shop, Newton, MA, USA "Predictable On-demand Perl Consulting." http://www.theperlshop.com/ From dfwpm at internetalias.net Mon Dec 30 14:32:44 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 30 Dec 2013 16:32:44 -0600 Subject: [DFW.pm] disk-read buffering? In-Reply-To: <52C1F27E.40104@gmail.com> References: <52C1F27E.40104@gmail.com> Message-ID: <52C1F48C.8080900@internetalias.net> On 12/30/2013 04:23 PM, Tom Metro wrote: > I thought there was going to be a server reboot/reset/rebuild between > runs. It will. > The closest thing to real-world is having a fully empty cache, but I > can't see any way that can be accomplished during development on a > shared server. ...and such should be the case, or as close to it as possible. > Ideally while testing you should be benchmarking small portions of > your code, so the cache will fill on the first run, and you have a > good chance they'll remain populated for several subsequent runs, > despite other users on the system hitting other files. No need to worry too much about this. The server won't be 'shared' at the time it's running the formal competition code benchmarks. It will have been completely reverted to its state before the contest began. Code will be cloned from github when it's time to run. Before each run, the server will be rolled back. It's wholly contained in a virtual machine for this reason. The encasing hardware/host will be totally idle other than its task of running the VM. We needed the ability to "reset" the contest server to a pristine state for each contestant, and having a virtual machine made perfect sense. As advised from the beginning: code that depends on caching will be self-limited for the above reasons. --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmetro+dfw-pm at gmail.com Mon Dec 30 14:41:56 2013 From: tmetro+dfw-pm at gmail.com (Tom Metro) Date: Mon, 30 Dec 2013 17:41:56 -0500 Subject: [DFW.pm] zero length files Message-ID: <52C1F6B4.4010807@gmail.com> What about zero length files? By one view, they are all duplicates of each other, as their content is identical. By another view, the concept of duplication is moot, as they have no content. Likely most algorithms will treat all zero byte files as identical, unless the code has a special case. Should they be handled specially? If so, how? Ignored? Grouped together in their own section of the report? -Tom -- Tom Metro The Perl Shop, Newton, MA, USA "Predictable On-demand Perl Consulting." http://www.theperlshop.com/ From dfwpm at internetalias.net Mon Dec 30 14:44:07 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 30 Dec 2013 16:44:07 -0600 Subject: [DFW.pm] zero length files In-Reply-To: <52C1F6B4.4010807@gmail.com> References: <52C1F6B4.4010807@gmail.com> Message-ID: <52C1F737.1060308@internetalias.net> For our purposes, they're dupes. Please report them as such. --Tommy Butler On 12/30/2013 04:41 PM, Tom Metro wrote: > What about zero length files? > > By one view, they are all duplicates of each other, as their content is > identical. By another view, the concept of duplication is moot, as they > have no content. > > Likely most algorithms will treat all zero byte files as identical, > unless the code has a special case. > > Should they be handled specially? If so, how? Ignored? Grouped together > in their own section of the report? > > -Tom > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Mon Dec 30 15:26:32 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 30 Dec 2013 17:26:32 -0600 Subject: [DFW.pm] what is a hard link, and what should my deduper do with them? In-Reply-To: <52C1F3A8.3020007@gmail.com> References: <52BF98E5.8060505@internetalias.net> <52C1F3A8.3020007@gmail.com> Message-ID: <52C20128.8050600@internetalias.net> On 12/30/2013 04:28 PM, Tom Metro wrote: > Tommy Butler wrote: >> ...other hard links should be considered, as already >> stated in the rules, "files already deduped". >> >> SCENARIO: >> >> The three files below have identical content: >> /foo/bar/baz.txt -> ( inode 12345 ) >> /foo/car/daz.txt -> ( inode 12345 ) >> /foo/far/gaz.txt -> ( inode 67890 ) >> >> >> OUTCOME: >> >> /foo/far/gaz.txt should be reported as a duplicate of /foo/bar/baz.txt >> because /foo/bar/baz.txt comes before /foo/car/daz.txt in a sort and >> because /foo/car/daz.txt is a hard link. > > So then the output might look like: > /foo/bar/baz.txt /foo/far/gaz.txt YES! :) > while /foo/car/daz.txt is simply eliminated from consideration and not > output at all? Yep. > The problem with this approach, if you are striving for a useful tool > and not just a programming exercise, is that you don't know which of > the aliases is the name most familiar to the user who will be > reviewing the report. And just when I've about finalized the output spec and created an output file to put up on the git repo for diffing ... this. :-) You are right, insofar as we are working to develop a useful tool and not create throw-away code to use once for a competition. I won't pick nits over the likelihood of a real-world scenario where hardlinks exist in Joe User's music collection. However for the sake of simplicity we're not going to require contestants to go this extra mile at this time. Everyone is free to implement an output format that reports hard link groupings and to do so for unredeemable "bonus" points. Should anyone, like me, want a useful tool when they are done with their code, they should strive to make it as robust and feature-ful as possible without sacrificing too much performance. After all, there is a winnings category for code that provides the most comprehensive feature set. As promised recently, a finalized "expected" output format (against which the product of each contestant's code will be diff'd) is forthcoming. It seems like a glaring oversight that this wasn't part of the original rules specification. I don't fault myself too much for this given the fact that the output format is to be so simple and up to this point everyone is expected to be coding against the problem and not the output. Watch for that email later on this evening. Thanks Tom, and thanks all! --Tommy Butler -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Mon Dec 30 18:40:52 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Mon, 30 Dec 2013 20:40:52 -0600 Subject: [DFW.pm] Deduplication Hackathon: Formal Output Specification Message-ID: <52C22EB4.8030303@internetalias.net> For your deduplication hackathon code entry, the output of your Perl app should be as follows: 1. Each grouping of duplicates should be sorted and printed out all on one line, by filename, deliminated by a tab character. 2. The lines of output should be sorted. 3. The sort you should use for both the lines of output and the file name groupings themselves is: sort { $a cmp $b } 4. Any output leading up to a delimiter of 30 dashes on its own line will be ignored. Any output coming after a second line comprised of 30 dashes is also ignored. These delimiter lines are optional if your output is solely comprised of the sorted results and nothing else. Otherwise, use the space to prefix your results with status messages or a status indicator (progress bar, etc), and optionally follow up your results with a summary of what your code encountered. See example at bottom of message. Your code can actually output whatever it wants, so long as there is a way to call it where it produces output according to the spec as outlined above. An example is provided in the lines below, and in the screenshot that follows. This output is generated by the code as found on github at https://github.com/tommybutler/dupfind In just a few minutes I will put up on (github at the same url) the correct output for the reference data that is currently on the contest server under /dedup. */Please take time to compare your code output to the output of the "reference design" code on github. If your output is not identical, then you will be disqualified for producing incorrect results. /*If you believe the reference design is incorrect, then please submit a bug report and/or a patch!! --Tommy Butler ------------------------------------------------------------------------ $ ./dupfind --format robot --dir . ** SCANNING ALL FILES ** CHECKSUMMING SIZE DUPLICATES ** DISPLAYING OUTPUT ------------------------------ ./.git/logs/HEAD ./.git/logs/refs/heads/master ./.git/refs/heads/master ./.git/refs/remotes/origin/master ./bar ./baz ./foo ------------------------------ ** TOTAL SCANNED: 86 ** TOTAL DUPES: 4 ** SCAN TIME: 0.00824308 wallclock secs ( 0.00 usr + 0.01 sys = 0.01 CPU) ** DELETION TIME: 0 ------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: edihfghd.png Type: image/png Size: 490429 bytes Desc: not available URL: From joakim.lagerqvist at gmail.com Tue Dec 31 02:19:11 2013 From: joakim.lagerqvist at gmail.com (Joakim Lagerqvist) Date: Tue, 31 Dec 2013 21:19:11 +1100 Subject: [DFW.pm] Deduplication Hackathon: Formal Output Specification In-Reply-To: <52C22EB4.8030303@internetalias.net> References: <52C22EB4.8030303@internetalias.net> Message-ID: Hello Tommy, On Tue, Dec 31, 2013 at 1:40 PM, Tommy Butler wrote: > *Please take time to compare your code output to the output of the > "reference design" code on github. If your output is not identical, then > you will be disqualified for producing incorrect results. * > In your reference design "human" output, you have included the xxHash value, is this needed for the contest? If another digest/method has been used to identify the duplicates, it will not match up. Cheers and happy new year, Joakim -------------- next part -------------- An HTML attachment was scrubbed... URL: From dfwpm at internetalias.net Tue Dec 31 12:37:12 2013 From: dfwpm at internetalias.net (Tommy Butler) Date: Tue, 31 Dec 2013 14:37:12 -0600 Subject: [DFW.pm] Deduplication Hackathon: Formal Output Specification In-Reply-To: References: <52C22EB4.8030303@internetalias.net> Message-ID: <52C32AF8.6070709@internetalias.net> You may use any hashing/mapping algorithm you like; xxhash was just a personal choice given it's emphasis on speed. The "robot" format of output is what must match up with your own output. The "human" format is strictly optional, and if you choose to create a human-readable output format option for your own code, it can look however you want it to. Remember to email your ssh key to me if you want server access and the ability to test your code against the reference data. --Tommy Butler On 12/31/2013 04:19 AM, Joakim Lagerqvist wrote: > Hello Tommy, > > On Tue, Dec 31, 2013 at 1:40 PM, Tommy Butler > wrote: > > */Please take time to compare your code output to the output of > the "reference design" code on github. If your output is not > identical, then you will be disqualified for producing incorrect > results. /* > > > In your reference design "human" output, you have included the xxHash > value, is this needed for the contest? If another digest/method has > been used to identify the duplicates, it will not match up. > > Cheers and happy new year, > Joakim -------------- next part -------------- An HTML attachment was scrubbed... URL: