From reini.urban at gmail.com Fri Apr 1 00:16:08 2016 From: reini.urban at gmail.com (Reini Urban) Date: Fri, 1 Apr 2016 09:16:08 +0200 Subject: [pm-h] Houston.pm April meeting at Hostgator... In-Reply-To: References: <20160327193032.0e47dbf3@cygnus> Message-ID: <66087191-8AC8-425D-A40E-81A1E8D30561@gmail.com> > On Mar 29, 2016, at 2:40 PM, Julian Brown via Houston wrote: > > If no one has anything, I was intrigued by a paper that came over the wires this week, url below. > > It is an optimization scheme for the kernel/client interaction that shows some significant promise especially for web-servers and databases. > > I will not be able to get any code but I am willing to put a keynote together on the highlights it might lead to some discussions. > > Here is the url to the paper: https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Soares.pdf I thought about supporting a :-die attribute to functions. perl function calls are very slow due to excessive exception handling code, which can be avoided if the compiler knows beforehand, that the function and each called function will not call die. Hence this function call will be much faster. I haven?t implemented it yet though. sub simple :-die { $a + 1 } Only a :-caller attribute to mimic proper tailcalls, or the SCOPE: NO attribute in XS functions. It doesn?t record itself in the CALLER chain. I was not happy with that, so it?s not merged into cperl master. > I found a slideshare on it as well: http://www.slideshare.net/liviosoares/flexsc-exceptionless-system-calls-presented-osdi-2010 From julian at jlbprof.com Fri Apr 1 04:32:04 2016 From: julian at jlbprof.com (Julian Brown) Date: Fri, 1 Apr 2016 06:32:04 -0500 Subject: [pm-h] Houston.pm April meeting at Hostgator... In-Reply-To: <20160331215538.076804e8@cygnus> References: <20160327193032.0e47dbf3@cygnus> <540EFF64-C225-4305-A7AE-38DB8F3D91AE@cpanel.net> <20160331215538.076804e8@cygnus> Message-ID: Wade, yes use the name of the slide set. On Thu, Mar 31, 2016 at 9:55 PM, G. Wade Johnson via Houston wrote: > On Wed, 30 Mar 2016 11:52:35 -0500 > Julian Brown via Houston wrote: > > > OK I will present it. > > Sounds like a winner. > Do you want to you the same title as the slides? > > G. Wade > > > Julian > > > > On Wed, Mar 30, 2016 at 9:19 AM, Todd Rinaldo > > wrote: > > > > > Sounds interesting to me > > > > > > On Mar 29, 2016, at 7:40 AM, Julian Brown via Houston > > > wrote: > > > > > > If no one has anything, I was intrigued by a paper that came over > > > the wires this week, url below. > > > > > > It is an optimization scheme for the kernel/client interaction that > > > shows some significant promise especially for web-servers and > > > databases. > > > > > > I will not be able to get any code but I am willing to put a keynote > > > together on the highlights it might lead to some discussions. > > > > > > Here is the url to the paper: > > > > https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Soares.pdf > > > > > > I found a slideshare on it as well: > > > > http://www.slideshare.net/liviosoares/flexsc-exceptionless-system-calls-presented-osdi-2010 > > > > > > Anyway > > > > > > > > > > > > On Sun, Mar 27, 2016 at 7:30 PM, G. Wade Johnson via Houston < > > > houston at pm.org> wrote: > > > > > >> Our next meeting is April 14 at Hostgator. > > >> > > >> Do we have any volunteers to present? Or, any topics to cover? > > >> > > >> G. Wade > > >> -- > > >> Be careful about reading health books. You may die of a misprint. > > >> -- Mark > > >> Twain _______________________________________________ > > >> Houston mailing list > > >> Houston at pm.org > > >> http://mail.pm.org/mailman/listinfo/houston > > >> Website: http://houston.pm.org/ > > >> > > > > > > _______________________________________________ > > > Houston mailing list > > > Houston at pm.org > > > http://mail.pm.org/mailman/listinfo/houston > > > Website: http://houston.pm.org/ > > > > > > > > > > > > -- > "No Boom today. Boom tomorrow, There's always a boom tomorrow." > -- Ivanova, "Grail" > _______________________________________________ > Houston mailing list > Houston at pm.org > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From estrabd at gmail.com Fri Apr 1 16:02:08 2016 From: estrabd at gmail.com (B. Estrade) Date: Fri, 1 Apr 2016 18:02:08 -0500 Subject: [pm-h] contractor for web site design In-Reply-To: References: Message-ID: <4A87148D-7D3E-46DC-8FC9-766496E547A0@gmail.com> Honestly, I'd check out fivver. I can also recommend someone. Brett > On Apr 1, 2016, at 12:59 AM, Russell Harris via Houston wrote: > > I am looking for a freelance developer to build a simple web site. I > prefer the site to be built directly with HTML5 and CSS, rather than with > a content management system such as Drupal or WordPress. There is no > deadline, but it would be nice to have a site running within a few months. > > = The web site is purely static, with no dynamic features. The only > changes to the site occur when new documents are added. > > = Graphic design is not important. > > = The function of the web site is to be an index to and a server for > documents authored by myself. Each page of the site consists of little > more than a series of links to documents. The documents are arranged in > several categories, with one category per web page. > > = The site is not a blog; thus, I have no need for features such as a > calendar, sorting, and searching. > > = I do not wish to make provision on the site for comments or discussion. > > = I do not wish to make provision on the site for e-commerce. > > = It is important that, whenever a new document is added, necessary > changes to the code can be made using nothing more than a text editor. > > I could do the coding myself, after brushing up on HTML and CSS. However, > because of other obligations, I prefer to pay a contractor to provide a > turn-key solution. > > I can provide additional detail to anyone who is interested. > > Russell Harris > 31 March 2016 > > 713-461-0081 > rlharris at oplink.net > > > _______________________________________________ > Houston mailing list > Houston at pm.org > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/ From mrallen1 at yahoo.com Fri Apr 1 19:22:34 2016 From: mrallen1 at yahoo.com (Mark Allen) Date: Sat, 2 Apr 2016 02:22:34 +0000 (UTC) Subject: [pm-h] contractor for web site design In-Reply-To: References: Message-ID: <1478651518.1408818.1459563755002.JavaMail.yahoo@mail.yahoo.com> You should spend an hour learning about / trying out?Contenticious | | | Contenticious build web sites from markdown files | | | It sounds like about 95% of what you want right out of the box. ?All you have to do is write the simple markdown files and execute one script and then upload the output files to your web server. On Friday, April 1, 2016 5:39 PM, Russell Harris via Houston wrote: I am looking for a freelance developer to build a simple web site.? I prefer the site to be built directly with HTML5 and CSS, rather than with a content management system such as Drupal or WordPress.? There is no deadline, but it would be nice to have a site running within a few months. = The web site is purely static, with no dynamic features.? The only changes to the site occur when new documents are added. = Graphic design is not important. = The function of the web site is to be an index to and a server for documents authored by myself.? Each page of the site consists of little more than a series of links to documents.? The documents are arranged in several categories, with one category per web page. = The site is not a blog; thus, I have no need for features such as a calendar, sorting, and searching. = I do not wish to make provision on the site for comments or discussion. = I do not wish to make provision on the site for e-commerce. = It is important that, whenever a new document is added, necessary changes to the code can be made using nothing more than a text editor. I could do the coding myself, after brushing up on HTML and CSS.? However, because of other obligations, I prefer to pay a contractor to provide a turn-key solution. I can provide additional detail to anyone who is interested. Russell Harris 31 March 2016 713-461-0081 rlharris at oplink.net _______________________________________________ Houston mailing list Houston at pm.org http://mail.pm.org/mailman/listinfo/houston Website: http://houston.pm.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.browder at gmail.com Tue Apr 5 11:02:09 2016 From: tom.browder at gmail.com (Tom Browder) Date: Tue, 5 Apr 2016 13:02:09 -0500 Subject: [pm-h] contractor for web site design In-Reply-To: <1478651518.1408818.1459563755002.JavaMail.yahoo@mail.yahoo.com> References: <1478651518.1408818.1459563755002.JavaMail.yahoo@mail.yahoo.com> Message-ID: On Fri, Apr 1, 2016 at 9:22 PM, Mark Allen via Houston wrote: > You should spend an hour learning about / trying out Contenticious > > Contenticious Excellent suggestion, Mark! It's a great module--thanks for sharing. Best regards, -Tom From gwadej at anomaly.org Thu Apr 7 05:34:29 2016 From: gwadej at anomaly.org (G. Wade Johnson) Date: Thu, 7 Apr 2016 07:34:29 -0500 Subject: [pm-h] Houston.pm April Meeting: FlexSC: Exception-Less System Calls Message-ID: <20160407073429.5ed8b132@cygnus> Our next meeting is Thursday March 14 at HostGator (https://maps.google.com/maps?q=HostGator,+Houston&fb=1&gl=us&hq=HostGator,&hnear=0x8640b8b4488d8501:0xca0d02def365053b,Houston,+TX&cid=2141572779937723859&t=h&z=16&iwloc=A). The meeting starts at 7pm. Julian Brown will present about a low-level technique for making more efficient system calls. Hope to see you there. G. Wade -- Virtual is when it's not but it looks like it is and transparent is when it is but it looks like it isn't. -- Rick Hoselton From tom.browder at gmail.com Fri Apr 8 08:39:10 2016 From: tom.browder at gmail.com (Tom Browder) Date: Fri, 8 Apr 2016 10:39:10 -0500 Subject: [pm-h] contractor for web site design In-Reply-To: <1478651518.1408818.1459563755002.JavaMail.yahoo@mail.yahoo.com> References: <1478651518.1408818.1459563755002.JavaMail.yahoo@mail.yahoo.com> Message-ID: On Fri, Apr 1, 2016 at 9:22 PM, Mark Allen via Houston wrote: > You should spend an hour learning about / trying out Contenticious > > Contenticious > > build web sites from markdown files > > It sounds like about 95% of what you want right out of the box. All you Great comment, Mark, and a great find--thanks a heap! Best regards, -Tom From tom.browder at gmail.com Fri Apr 8 08:54:41 2016 From: tom.browder at gmail.com (Tom Browder) Date: Fri, 8 Apr 2016 10:54:41 -0500 Subject: [pm-h] contractor for web site design In-Reply-To: References: <1478651518.1408818.1459563755002.JavaMail.yahoo@mail.yahoo.com> Message-ID: On Fri, Apr 8, 2016 at 10:43 AM, Matt Dees wrote: > If you are interested in that class of tools I would also check out: > > http://blog.getpelican.com/ > https://ghost.org/ Interesting--thanks. I have been looking for an add-on for some of my sites to allow content submission by members and have looked at several wiki and photo album projects, but none so far are quite what I am looking for (Foswiki and Coppermine are the closest I've found, and should be usable). I really want a Perl project if at all possible, and it must integrate with my mostly static sites with no major rework. Any ideas? Thanks, again, Mark -Tom From mrallen1 at yahoo.com Fri Apr 8 10:28:01 2016 From: mrallen1 at yahoo.com (Mark Allen) Date: Fri, 8 Apr 2016 17:28:01 +0000 (UTC) Subject: [pm-h] contractor for web site design In-Reply-To: References: Message-ID: <1330176057.1790880.1460136482126.JavaMail.yahoo@mail.yahoo.com> There's Galileo CMS which is also built on?top of Mojolicious Tom. It's on metacpan too if you want to see it. On Friday, April 8, 2016 10:55 AM, Tom Browder wrote: On Fri, Apr 8, 2016 at 10:43 AM, Matt Dees wrote: > If you are interested in that class of tools? I would also check out: > > http://blog.getpelican.com/ > https://ghost.org/ Interesting--thanks. I have been looking for an add-on for some of my sites to allow content submission by members and have looked at several wiki and photo album projects, but none so far are quite what I am looking for (Foswiki and Coppermine are the closest I've found, and should be usable).? I really want a Perl project if at all possible, and it must integrate with my mostly static sites with no major rework. Any ideas? Thanks, again, Mark -Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.browder at gmail.com Fri Apr 8 11:17:34 2016 From: tom.browder at gmail.com (Tom Browder) Date: Fri, 8 Apr 2016 13:17:34 -0500 Subject: [pm-h] contractor for web site design In-Reply-To: <1330176057.1790880.1460136482126.JavaMail.yahoo@mail.yahoo.com> References: <1330176057.1790880.1460136482126.JavaMail.yahoo@mail.yahoo.com> Message-ID: On Fri, Apr 8, 2016 at 12:28 PM, Mark Allen wrote: > Galileo CMS Thanks, Mark! -Tom From mikeflan at att.net Sat Apr 9 17:54:32 2016 From: mikeflan at att.net (Mike Flannigan) Date: Sat, 9 Apr 2016 19:54:32 -0500 Subject: [pm-h] Device::USB::Device Message-ID: <5709A448.8070400@att.net> Today I was trying to get a TemperNTC temperature sensor to work with Device::USB::PCSensor::HidTEMPer on Strawberry Perl on Windows. That didn't work out in the little time I allotted to it, but I did see that Wade Johnson and Paul Archer of the Houston Perl Mongers group are authors of Device::USB::Device. I thought that was cool. Mike From gwadej at anomaly.org Sun Apr 10 06:45:19 2016 From: gwadej at anomaly.org (G. Wade Johnson) Date: Sun, 10 Apr 2016 08:45:19 -0500 Subject: [pm-h] Device::USB::Device In-Reply-To: <5709A448.8070400@att.net> References: <5709A448.8070400@att.net> Message-ID: <20160410084519.465e36f4@cygnus> On Sat, 9 Apr 2016 19:54:32 -0500 Mike Flannigan via Houston wrote: > > Today I was trying to get a TemperNTC temperature > sensor to work with Device::USB::PCSensor::HidTEMPer > on Strawberry Perl on Windows. That didn't work out > in the little time I allotted to it, but I did see > that Wade Johnson and Paul Archer of the Houston Perl > Mongers group are authors of Device::USB::Device. > I thought that was cool. Paul Archer did the code for that project for three Houston.pm meetings at the beginning of 2006. G. Wade > Mike > > _______________________________________________ > Houston mailing list > Houston at pm.org > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/ -- If you like laws and sausages, you should never watch either one being made. -- Otto von Bismarck From drzigman at drzigman.com Fri Apr 15 09:33:09 2016 From: drzigman at drzigman.com (Robert Stone) Date: Fri, 15 Apr 2016 11:33:09 -0500 Subject: [pm-h] Paper Relevant to Our Last Meeting - The Linux Scheduler: a Decade of Wasted Cores Message-ID: Greetings, Given our discussion last night (great talk btw, thank you!) I found it interesting that the following paper appeared on Hacker News this morning. The paper - http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf The Hacker News thread - https://news.ycombinator.com/item?id=11501493 An interesting read if you are a fan of schedulers and I mean, who doesn't love a good Fair Scheduler? Besides, it's bursting with fun quotes from Linus like: "I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably trivial . Patches already exist, and I don?t feel that people can screw up the few hundred lines too badly." Challenge accepted! Best Regards, Robert Stone -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian at jlbprof.com Tue Apr 26 09:09:13 2016 From: julian at jlbprof.com (Julian Brown) Date: Tue, 26 Apr 2016 11:09:13 -0500 Subject: [pm-h] Linux Scheduler Message-ID: https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwadej at anomaly.org Tue Apr 26 19:50:19 2016 From: gwadej at anomaly.org (G. Wade Johnson) Date: Tue, 26 Apr 2016 21:50:19 -0500 Subject: [pm-h] April meeting notes are on-line Message-ID: <20160426215019.21205c6e@cygnus> The notes from the last meeting are finally on-line at http://houston.pm.org/talks/2016talks/1604Talk/ G. Wade -- To vacillate or not to vacillate, that is the question ... or is it? From anaremore at hostgator.com Tue Apr 26 20:07:30 2016 From: anaremore at hostgator.com (Austin Naremore) Date: Tue, 26 Apr 2016 22:07:30 -0500 Subject: [pm-h] April meeting notes are on-line In-Reply-To: <20160426215019.21205c6e@cygnus> References: <20160426215019.21205c6e@cygnus> Message-ID: Minor typo... 'ooverhead'. Unless you meant *oo*verhead like M*oo*se jk jk <3 On Tue, Apr 26, 2016 at 9:50 PM, G. Wade Johnson via Houston wrote: > The notes from the last meeting are finally on-line at > http://houston.pm.org/talks/2016talks/1604Talk/ > > G. Wade > -- > To vacillate or not to vacillate, that is the question ... or is it? > _______________________________________________ > Houston mailing list > Houston at pm.org > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/ > -- *Austin Naremore*HTX Development Manager HostGator.com anaremore at hostgator.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwadej at anomaly.org Tue Apr 26 20:51:43 2016 From: gwadej at anomaly.org (G. Wade Johnson) Date: Tue, 26 Apr 2016 22:51:43 -0500 Subject: [pm-h] April meeting notes are on-line In-Reply-To: References: <20160426215019.21205c6e@cygnus> Message-ID: <20160426225143.088edc6c@cygnus> On Tue, 26 Apr 2016 22:07:30 -0500 Austin Naremore via Houston wrote: > Minor typo... 'ooverhead'. > > Unless you meant *oo*verhead like M*oo*se jk jk <3 Thanks. I oobviously goofed oon that oone. G. Wade > On Tue, Apr 26, 2016 at 9:50 PM, G. Wade Johnson via Houston > > wrote: > > > The notes from the last meeting are finally on-line at > > http://houston.pm.org/talks/2016talks/1604Talk/ > > > > G. Wade > > -- > > To vacillate or not to vacillate, that is the question ... or is it? > > _______________________________________________ > > Houston mailing list > > Houston at pm.org > > http://mail.pm.org/mailman/listinfo/houston > > Website: http://houston.pm.org/ > > > > > -- The computer should be doing the hard work. That's what it's paid to do, after all. -- Larry Wall From gwadej at anomaly.org Wed Apr 27 16:24:57 2016 From: gwadej at anomaly.org (G. Wade Johnson) Date: Wed, 27 Apr 2016 18:24:57 -0500 Subject: [pm-h] Topic for May Houston.pm meeting in May Message-ID: <20160427182457.2d776a46@cygnus> Our next meeting is on May 12 at cPanel. So, once again, it's time to ask for volunteers for that presentation. Any ideas, opinions, presentations, etc.? G. Wade -- You write code as if the person who will maintain your code is a violent psychopath who knows where you live. -- John F. Woods From mev412 at gmail.com Fri Apr 29 20:31:16 2016 From: mev412 at gmail.com (Mev412) Date: Fri, 29 Apr 2016 22:31:16 -0500 Subject: [pm-h] Binary Search Tree File Message-ID: The conversation at the last meeting sparked my interest to implement the file-based binary search tree. https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm Build the file: my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); $file->write_file({ test => "blah" }); Reading values: my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); my $value = $file->get("test"); It performed well against a file-based, linear search. You can see how the linear search doubles as the records doubles. Haven't measured to see how close to O(log n) it is, but it appears to do well. It barely flinches when going from 1 to 2 million records. Time is seconds to locate single record (worst-case-scenario) # of records, binary-search-tree, linear 1024, 4.48226928710938e-05, 0.000963926315307617 2048, 4.31537628173828e-05, 0.00278782844543457 4096, 3.38554382324219e-05, 0.00162196159362793 8192, 5.07831573486328e-05, 0.0121698379516602 16384, 4.60147857666016e-05, 0.0115268230438232 32768, 6.58035278320312e-05, 0.0142660140991211 65536, 0.000729084014892578, 0.0285739898681641 131072, 0.00218009948730469, 0.0539009571075439 262144, 0.00141692161560059, 0.1079261302948 524288, 0.0019831657409668, 0.214764833450317 1048576, 0.00240302085876465, 0.434930086135864 2097152, 0.00240802764892578, 0.875269889831543 The header format is [left-node position][right-node position][value length][key][value] It currently uses a static key size, so it can read in the key along with the rest of the header. This takes up more disk space but should be faster than an extra read. If there's any natural buffering of the file though then this may not incur a performance penalty so I'll have to benchmark. This is the main search logic my $header; read $fh, $header, $HEADER_SIZE; my $file_key = substr($header, 12, $KEY_SIZE); my $val_len = unpack("V", substr($header, 8, 4)); my $right = unpack("V", substr($header, 4, 4)); my $left = unpack("V", substr($header, 0, 4)); my $comp = $key cmp $file_key; if ($comp == 0) { my $val; read $fh, $val, $val_len; return $val; } elsif ($comp == -1) { if ($left == 0) { return undef; } $self->find_key($key, $left); } else { if ($right == 0) { return undef; } $self->find_key($key, $right); } The writing of the file sorts the key/value pairs, builds a BST, while building the BST a 'flat' list of the nodes is built along with the positions of their left and right node. Recording the position of the node itself made writing the file easier, then this is fed to a method which writes each node to the file. The writing of the file is not memory-efficient as it builds the BST in memory for simplicity, though this cost is only incurred once when the file is written. If it could both insert and balance the file-based tree then this would be ideal so I'll have to look into some ways to do that. Another consideration would be storing all the values at the end of the file so the headers run sequentially. Especially if the values are longer, this could improve cache hits / buffering. It's a work-in-progress as I need to make some methods private, make key-size configurable, add documentation and tests, then I might see if I can upload to cpan. Anyways, just wanted to share. Let me know what you think. Always enjoy the talks and the technical discussions that ensue :) Best Regards, Christopher Mevissen -------------- next part -------------- An HTML attachment was scrubbed... URL: From reini.urban at gmail.com Fri Apr 29 23:51:42 2016 From: reini.urban at gmail.com (Reini Urban) Date: Sat, 30 Apr 2016 08:51:42 +0200 Subject: [pm-h] Binary Search Tree File In-Reply-To: References: Message-ID: <58F725B6-480B-43A2-B656-0D96C425AA6B@gmail.com> Ordinary BST?s are not really state of the art any more on modern CPU?s. The overhead of the two absolute pointers trash the L1 cache, the very same problem as with our perl5 op tree overhead. One of the reasons why perl5 is so slow. Ditto linked lists. I also saw you trying it with Config, and there you can easily see how my gperf (a static perfect hash) outperforms the BST. https://github.com/toddr/ConfigDat vs https://github.com/perl11/p5-Config And the gperf hash is not always the best method, I just haven?t had enough time to finish my Perfect::Hash module which comes up with better characteristics then gperf in some cases. bulk88 optimized the hell out of it lately. https://github.com/rurban/Perfect-Hash#benchmarks State of the art besides properly implemented hash tables (i.e. NOT perl5 hash tables) are Van Emde Boas binary search tries, which perform much better than ordinary binary search tries, Note: trie != tree. No right-left pointers needed. But even with the advantage of a trie, the traditional binary search layout is not optimal anymore. https://en.wikipedia.org/wiki/Van_Emde_Boas_tree radix trees with optimizations on word-sizes (Patricia trie) also perform much better, e.g. judy or HAT-trie. A good HAT-trie is as fast as a proper hash table, esp. for smaller sizes. Some links: search for Cache Oblivious Search Tree nice maps: https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf Reini Urban rurban at cpan.org > On Apr 30, 2016, at 5:31 AM, Mev412 via Houston wrote: > > The conversation at the last meeting sparked my interest to implement the file-based binary search tree. > > https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm > > Build the file: > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > $file->write_file({ test => "blah" }); > > Reading values: > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > my $value = $file->get("test"); > > It performed well against a file-based, linear search. You can see how the linear search doubles as the records doubles. Haven't measured to see how close to O(log n) it is, but it appears to do well. It barely flinches when going from 1 to 2 million records. > > Time is seconds to locate single record (worst-case-scenario) > > # of records, binary-search-tree, linear > 1024, 4.48226928710938e-05, 0.000963926315307617 > 2048, 4.31537628173828e-05, 0.00278782844543457 > 4096, 3.38554382324219e-05, 0.00162196159362793 > 8192, 5.07831573486328e-05, 0.0121698379516602 > 16384, 4.60147857666016e-05, 0.0115268230438232 > 32768, 6.58035278320312e-05, 0.0142660140991211 > 65536, 0.000729084014892578, 0.0285739898681641 > 131072, 0.00218009948730469, 0.0539009571075439 > 262144, 0.00141692161560059, 0.1079261302948 > 524288, 0.0019831657409668, 0.214764833450317 > 1048576, 0.00240302085876465, 0.434930086135864 > 2097152, 0.00240802764892578, 0.875269889831543 > > The header format is > [left-node position][right-node position][value length][key][value] > > It currently uses a static key size, so it can read in the key along with the rest of the header. This takes up more disk space but should be faster than an extra read. If there's any natural buffering of the file though then this may not incur a performance penalty so I'll have to benchmark. > > This is the main search logic > > my $header; > read $fh, $header, $HEADER_SIZE; > > my $file_key = substr($header, 12, $KEY_SIZE); > my $val_len = unpack("V", substr($header, 8, 4)); > my $right = unpack("V", substr($header, 4, 4)); > my $left = unpack("V", substr($header, 0, 4)); > > my $comp = $key cmp $file_key; > > if ($comp == 0) { > my $val; > read $fh, $val, $val_len; > return $val; > } > elsif ($comp == -1) { > if ($left == 0) { > return undef; > } > > $self->find_key($key, $left); > } > else { > if ($right == 0) { > return undef; > } > > $self->find_key($key, $right); > } > > The writing of the file sorts the key/value pairs, builds a BST, while building the BST a 'flat' list of the nodes is built along with the positions of their left and right node. Recording the position of the node itself made writing the file easier, then this is fed to a method which writes each node to the file. > > The writing of the file is not memory-efficient as it builds the BST in memory for simplicity, though this cost is only incurred once when the file is written. If it could both insert and balance the file-based tree then this would be ideal so I'll have to look into some ways to do that. > > Another consideration would be storing all the values at the end of the file so the headers run sequentially. Especially if the values are longer, this could improve cache hits / buffering. > > It's a work-in-progress as I need to make some methods private, make key-size configurable, add documentation and tests, then I might see if I can upload to cpan. > > Anyways, just wanted to share. Let me know what you think. Always enjoy the talks and the technical discussions that ensue :) > > > Best Regards, > Christopher Mevissen > > _______________________________________________ > Houston mailing list > Houston at pm.org > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/ From mev412 at gmail.com Sat Apr 30 10:09:05 2016 From: mev412 at gmail.com (Mev412) Date: Sat, 30 Apr 2016 12:09:05 -0500 Subject: [pm-h] Binary Search Tree File In-Reply-To: <58F725B6-480B-43A2-B656-0D96C425AA6B@gmail.com> References: <58F725B6-480B-43A2-B656-0D96C425AA6B@gmail.com> Message-ID: Can't say this was meant to be "state of the art". Todd mentioned the perfect hashes, but the original problem was how to lower the memory footprint as much as possible. So any data structure stored completely in memory wouldn't be ideal. This was the motivation to use something where searching could happen by seeking through the file rather than in-memory. As far as L1 cache, the keys are stored sequentially on disk so this should utilize cache a lot better than random in-memory pointers. Best Regards, Christopher Mevissen On Sat, Apr 30, 2016 at 1:51 AM, Reini Urban wrote: > Ordinary BST?s are not really state of the art any more on modern CPU?s. > The overhead of the two absolute pointers trash the L1 cache, the very > same problem > as with our perl5 op tree overhead. One of the reasons why perl5 is so > slow. Ditto linked lists. > > I also saw you trying it with Config, and there you can easily see how my > gperf (a static perfect hash) outperforms the BST. > > https://github.com/toddr/ConfigDat vs https://github.com/perl11/p5-Config > > And the gperf hash is not always the best method, I just haven?t had > enough time to finish > my Perfect::Hash module which comes up with better characteristics then > gperf in some cases. > bulk88 optimized the hell out of it lately. > https://github.com/rurban/Perfect-Hash#benchmarks > > State of the art besides properly implemented hash tables (i.e. NOT perl5 > hash tables) > are Van Emde Boas binary search tries, which perform much better than > ordinary binary search tries, > Note: trie != tree. No right-left pointers needed. But even with the > advantage of a trie, the traditional > binary search layout is not optimal anymore. > > https://en.wikipedia.org/wiki/Van_Emde_Boas_tree > > radix trees with optimizations on word-sizes (Patricia trie) also perform > much better, e.g. judy or HAT-trie. > A good HAT-trie is as fast as a proper hash table, esp. for smaller sizes. > > Some links: > search for Cache Oblivious Search Tree > nice maps: > > https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf > > > Reini Urban > rurban at cpan.org > > > > > On Apr 30, 2016, at 5:31 AM, Mev412 via Houston wrote: > > > > The conversation at the last meeting sparked my interest to implement > the file-based binary search tree. > > > > > https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm > > > > Build the file: > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > $file->write_file({ test => "blah" }); > > > > Reading values: > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > my $value = $file->get("test"); > > > > It performed well against a file-based, linear search. You can see how > the linear search doubles as the records doubles. Haven't measured to see > how close to O(log n) it is, but it appears to do well. It barely flinches > when going from 1 to 2 million records. > > > > Time is seconds to locate single record (worst-case-scenario) > > > > # of records, binary-search-tree, linear > > 1024, 4.48226928710938e-05, 0.000963926315307617 > > 2048, 4.31537628173828e-05, 0.00278782844543457 > > 4096, 3.38554382324219e-05, 0.00162196159362793 > > 8192, 5.07831573486328e-05, 0.0121698379516602 > > 16384, 4.60147857666016e-05, 0.0115268230438232 > > 32768, 6.58035278320312e-05, 0.0142660140991211 > > 65536, 0.000729084014892578, 0.0285739898681641 > > 131072, 0.00218009948730469, 0.0539009571075439 > > 262144, 0.00141692161560059, 0.1079261302948 > > 524288, 0.0019831657409668, 0.214764833450317 > > 1048576, 0.00240302085876465, 0.434930086135864 > > 2097152, 0.00240802764892578, 0.875269889831543 > > > > The header format is > > [left-node position][right-node position][value length][key][value] > > > > It currently uses a static key size, so it can read in the key along > with the rest of the header. This takes up more disk space but should be > faster than an extra read. If there's any natural buffering of the file > though then this may not incur a performance penalty so I'll have to > benchmark. > > > > This is the main search logic > > > > my $header; > > read $fh, $header, $HEADER_SIZE; > > > > my $file_key = substr($header, 12, $KEY_SIZE); > > my $val_len = unpack("V", substr($header, 8, 4)); > > my $right = unpack("V", substr($header, 4, 4)); > > my $left = unpack("V", substr($header, 0, 4)); > > > > my $comp = $key cmp $file_key; > > > > if ($comp == 0) { > > my $val; > > read $fh, $val, $val_len; > > return $val; > > } > > elsif ($comp == -1) { > > if ($left == 0) { > > return undef; > > } > > > > $self->find_key($key, $left); > > } > > else { > > if ($right == 0) { > > return undef; > > } > > > > $self->find_key($key, $right); > > } > > > > The writing of the file sorts the key/value pairs, builds a BST, while > building the BST a 'flat' list of the nodes is built along with the > positions of their left and right node. Recording the position of the node > itself made writing the file easier, then this is fed to a method which > writes each node to the file. > > > > The writing of the file is not memory-efficient as it builds the BST in > memory for simplicity, though this cost is only incurred once when the file > is written. If it could both insert and balance the file-based tree then > this would be ideal so I'll have to look into some ways to do that. > > > > Another consideration would be storing all the values at the end of the > file so the headers run sequentially. Especially if the values are longer, > this could improve cache hits / buffering. > > > > It's a work-in-progress as I need to make some methods private, make > key-size configurable, add documentation and tests, then I might see if I > can upload to cpan. > > > > Anyways, just wanted to share. Let me know what you think. Always enjoy > the talks and the technical discussions that ensue :) > > > > > > Best Regards, > > Christopher Mevissen > > > > _______________________________________________ > > Houston mailing list > > Houston at pm.org > > http://mail.pm.org/mailman/listinfo/houston > > Website: http://houston.pm.org/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reini.urban at gmail.com Sat Apr 30 10:14:24 2016 From: reini.urban at gmail.com (Reini Urban) Date: Sat, 30 Apr 2016 19:14:24 +0200 Subject: [pm-h] Binary Search Tree File In-Reply-To: References: <58F725B6-480B-43A2-B656-0D96C425AA6B@gmail.com> Message-ID: <35DDCCA7-ED83-4330-8B7A-9B9E4A3A3230@gmail.com> No, the 1st cache argument is to skip the 2 unnecessary left and right pointers and use a trie instead. we do that even perl5. the 2nd cache argument is to use the Van Emde Boas layout and not the simple binary search trie layout. the 3rd cache argument, is to use a good radix trie, and not a BST at all. if memory is a problem a bloom filter in front of it will beat everything. my bloom filter deduper for 1G files used 100x less memory than all the others. Reini Urban rurban at cpan.org > On Apr 30, 2016, at 7:09 PM, Mev412 wrote: > > Can't say this was meant to be "state of the art". Todd mentioned the perfect hashes, but the original problem was how to lower the memory footprint as much as possible. So any data structure stored completely in memory wouldn't be ideal. This was the motivation to use something where searching could happen by seeking through the file rather than in-memory. As far as L1 cache, the keys are stored sequentially on disk so this should utilize cache a lot better than random in-memory pointers. > > Best Regards, > Christopher Mevissen > > On Sat, Apr 30, 2016 at 1:51 AM, Reini Urban wrote: > Ordinary BST?s are not really state of the art any more on modern CPU?s. > The overhead of the two absolute pointers trash the L1 cache, the very same problem > as with our perl5 op tree overhead. One of the reasons why perl5 is so slow. Ditto linked lists. > > I also saw you trying it with Config, and there you can easily see how my > gperf (a static perfect hash) outperforms the BST. > > https://github.com/toddr/ConfigDat vs https://github.com/perl11/p5-Config > > And the gperf hash is not always the best method, I just haven?t had enough time to finish > my Perfect::Hash module which comes up with better characteristics then gperf in some cases. > bulk88 optimized the hell out of it lately. > https://github.com/rurban/Perfect-Hash#benchmarks > > State of the art besides properly implemented hash tables (i.e. NOT perl5 hash tables) > are Van Emde Boas binary search tries, which perform much better than ordinary binary search tries, > Note: trie != tree. No right-left pointers needed. But even with the advantage of a trie, the traditional > binary search layout is not optimal anymore. > > https://en.wikipedia.org/wiki/Van_Emde_Boas_tree > > radix trees with optimizations on word-sizes (Patricia trie) also perform much better, e.g. judy or HAT-trie. > A good HAT-trie is as fast as a proper hash table, esp. for smaller sizes. > > Some links: > search for Cache Oblivious Search Tree > nice maps: > https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf > > > Reini Urban > rurban at cpan.org > > > > > On Apr 30, 2016, at 5:31 AM, Mev412 via Houston wrote: > > > > The conversation at the last meeting sparked my interest to implement the file-based binary search tree. > > > > https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm > > > > Build the file: > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > $file->write_file({ test => "blah" }); > > > > Reading values: > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > my $value = $file->get("test"); > > > > It performed well against a file-based, linear search. You can see how the linear search doubles as the records doubles. Haven't measured to see how close to O(log n) it is, but it appears to do well. It barely flinches when going from 1 to 2 million records. > > > > Time is seconds to locate single record (worst-case-scenario) > > > > # of records, binary-search-tree, linear > > 1024, 4.48226928710938e-05, 0.000963926315307617 > > 2048, 4.31537628173828e-05, 0.00278782844543457 > > 4096, 3.38554382324219e-05, 0.00162196159362793 > > 8192, 5.07831573486328e-05, 0.0121698379516602 > > 16384, 4.60147857666016e-05, 0.0115268230438232 > > 32768, 6.58035278320312e-05, 0.0142660140991211 > > 65536, 0.000729084014892578, 0.0285739898681641 > > 131072, 0.00218009948730469, 0.0539009571075439 > > 262144, 0.00141692161560059, 0.1079261302948 > > 524288, 0.0019831657409668, 0.214764833450317 > > 1048576, 0.00240302085876465, 0.434930086135864 > > 2097152, 0.00240802764892578, 0.875269889831543 > > > > The header format is > > [left-node position][right-node position][value length][key][value] > > > > It currently uses a static key size, so it can read in the key along with the rest of the header. This takes up more disk space but should be faster than an extra read. If there's any natural buffering of the file though then this may not incur a performance penalty so I'll have to benchmark. > > > > This is the main search logic > > > > my $header; > > read $fh, $header, $HEADER_SIZE; > > > > my $file_key = substr($header, 12, $KEY_SIZE); > > my $val_len = unpack("V", substr($header, 8, 4)); > > my $right = unpack("V", substr($header, 4, 4)); > > my $left = unpack("V", substr($header, 0, 4)); > > > > my $comp = $key cmp $file_key; > > > > if ($comp == 0) { > > my $val; > > read $fh, $val, $val_len; > > return $val; > > } > > elsif ($comp == -1) { > > if ($left == 0) { > > return undef; > > } > > > > $self->find_key($key, $left); > > } > > else { > > if ($right == 0) { > > return undef; > > } > > > > $self->find_key($key, $right); > > } > > > > The writing of the file sorts the key/value pairs, builds a BST, while building the BST a 'flat' list of the nodes is built along with the positions of their left and right node. Recording the position of the node itself made writing the file easier, then this is fed to a method which writes each node to the file. > > > > The writing of the file is not memory-efficient as it builds the BST in memory for simplicity, though this cost is only incurred once when the file is written. If it could both insert and balance the file-based tree then this would be ideal so I'll have to look into some ways to do that. > > > > Another consideration would be storing all the values at the end of the file so the headers run sequentially. Especially if the values are longer, this could improve cache hits / buffering. > > > > It's a work-in-progress as I need to make some methods private, make key-size configurable, add documentation and tests, then I might see if I can upload to cpan. > > > > Anyways, just wanted to share. Let me know what you think. Always enjoy the talks and the technical discussions that ensue :) > > > > > > Best Regards, > > Christopher Mevissen > > > > _______________________________________________ > > Houston mailing list > > Houston at pm.org > > http://mail.pm.org/mailman/listinfo/houston > > Website: http://houston.pm.org/ > > From gwadej at anomaly.org Sat Apr 30 14:40:57 2016 From: gwadej at anomaly.org (G. Wade Johnson) Date: Sat, 30 Apr 2016 16:40:57 -0500 Subject: [pm-h] Binary Search Tree File In-Reply-To: <35DDCCA7-ED83-4330-8B7A-9B9E4A3A3230@gmail.com> References: <58F725B6-480B-43A2-B656-0D96C425AA6B@gmail.com> <35DDCCA7-ED83-4330-8B7A-9B9E4A3A3230@gmail.com> Message-ID: <20160430164057.1d16785f@cygnus> On Sat, 30 Apr 2016 19:14:24 +0200 Reini Urban via Houston wrote: > No, the 1st cache argument is to skip the 2 unnecessary left and > right pointers and use a trie instead. we do that even perl5. the 2nd > cache argument is to use the Van Emde Boas layout and not the simple > binary search trie layout. the 3rd cache argument, is to use a good > radix trie, and not a BST at all. > > if memory is a problem a bloom filter in front of it will beat > everything. my bloom filter deduper for 1G files used 100x less > memory than all the others. While doing this on disk for a real system would benefit from a more powerful data structure (I've seen B+ trees used effectively.) But, for understanding the problem, a binary tree is easy to grasp and serves as a good springboard to explore. In trying to learn the trade-offs in storing and retrieving this kind of data, there is a lot of benefit in working through the problem from this level. Chris, you should keep the group posted as you explore the problem and potential solutions. G. Wade > Reini Urban > rurban at cpan.org > > > > > On Apr 30, 2016, at 7:09 PM, Mev412 wrote: > > > > Can't say this was meant to be "state of the art". Todd mentioned > > the perfect hashes, but the original problem was how to lower the > > memory footprint as much as possible. So any data structure stored > > completely in memory wouldn't be ideal. This was the motivation to > > use something where searching could happen by seeking through the > > file rather than in-memory. As far as L1 cache, the keys are stored > > sequentially on disk so this should utilize cache a lot better than > > random in-memory pointers. > > > > Best Regards, > > Christopher Mevissen > > > > On Sat, Apr 30, 2016 at 1:51 AM, Reini Urban > > wrote: Ordinary BST?s are not really state > > of the art any more on modern CPU?s. The overhead of the two > > absolute pointers trash the L1 cache, the very same problem as with > > our perl5 op tree overhead. One of the reasons why perl5 is so > > slow. Ditto linked lists. > > > > I also saw you trying it with Config, and there you can easily see > > how my gperf (a static perfect hash) outperforms the BST. > > > > https://github.com/toddr/ConfigDat vs > > https://github.com/perl11/p5-Config > > > > And the gperf hash is not always the best method, I just haven?t > > had enough time to finish my Perfect::Hash module which comes up > > with better characteristics then gperf in some cases. bulk88 > > optimized the hell out of it lately. > > https://github.com/rurban/Perfect-Hash#benchmarks > > > > State of the art besides properly implemented hash tables (i.e. NOT > > perl5 hash tables) are Van Emde Boas binary search tries, which > > perform much better than ordinary binary search tries, Note: > > trie != tree. No right-left pointers needed. But even with the > > advantage of a trie, the traditional binary search layout is not > > optimal anymore. > > > > https://en.wikipedia.org/wiki/Van_Emde_Boas_tree > > > > radix trees with optimizations on word-sizes (Patricia trie) also > > perform much better, e.g. judy or HAT-trie. A good HAT-trie is as > > fast as a proper hash table, esp. for smaller sizes. > > > > Some links: > > search for Cache Oblivious Search Tree > > nice maps: > > https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf > > > > > > Reini Urban > > rurban at cpan.org > > > > > > > > > On Apr 30, 2016, at 5:31 AM, Mev412 via Houston > > > wrote: > > > > > > The conversation at the last meeting sparked my interest to > > > implement the file-based binary search tree. > > > > > > https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm > > > > > > Build the file: > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > > $file->write_file({ test => "blah" }); > > > > > > Reading values: > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > > my $value = $file->get("test"); > > > > > > It performed well against a file-based, linear search. You can > > > see how the linear search doubles as the records doubles. Haven't > > > measured to see how close to O(log n) it is, but it appears to do > > > well. It barely flinches when going from 1 to 2 million records. > > > > > > Time is seconds to locate single record (worst-case-scenario) > > > > > > # of records, binary-search-tree, linear > > > 1024, 4.48226928710938e-05, 0.000963926315307617 > > > 2048, 4.31537628173828e-05, 0.00278782844543457 > > > 4096, 3.38554382324219e-05, 0.00162196159362793 > > > 8192, 5.07831573486328e-05, 0.0121698379516602 > > > 16384, 4.60147857666016e-05, 0.0115268230438232 > > > 32768, 6.58035278320312e-05, 0.0142660140991211 > > > 65536, 0.000729084014892578, 0.0285739898681641 > > > 131072, 0.00218009948730469, 0.0539009571075439 > > > 262144, 0.00141692161560059, 0.1079261302948 > > > 524288, 0.0019831657409668, 0.214764833450317 > > > 1048576, 0.00240302085876465, 0.434930086135864 > > > 2097152, 0.00240802764892578, 0.875269889831543 > > > > > > The header format is > > > [left-node position][right-node position][value > > > length][key][value] > > > > > > It currently uses a static key size, so it can read in the key > > > along with the rest of the header. This takes up more disk space > > > but should be faster than an extra read. If there's any natural > > > buffering of the file though then this may not incur a > > > performance penalty so I'll have to benchmark. > > > > > > This is the main search logic > > > > > > my $header; > > > read $fh, $header, $HEADER_SIZE; > > > > > > my $file_key = substr($header, 12, $KEY_SIZE); > > > my $val_len = unpack("V", substr($header, 8, 4)); > > > my $right = unpack("V", substr($header, 4, 4)); > > > my $left = unpack("V", substr($header, 0, 4)); > > > > > > my $comp = $key cmp $file_key; > > > > > > if ($comp == 0) { > > > my $val; > > > read $fh, $val, $val_len; > > > return $val; > > > } > > > elsif ($comp == -1) { > > > if ($left == 0) { > > > return undef; > > > } > > > > > > $self->find_key($key, $left); > > > } > > > else { > > > if ($right == 0) { > > > return undef; > > > } > > > > > > $self->find_key($key, $right); > > > } > > > > > > The writing of the file sorts the key/value pairs, builds a BST, > > > while building the BST a 'flat' list of the nodes is built along > > > with the positions of their left and right node. Recording the > > > position of the node itself made writing the file easier, then > > > this is fed to a method which writes each node to the file. > > > > > > The writing of the file is not memory-efficient as it builds the > > > BST in memory for simplicity, though this cost is only incurred > > > once when the file is written. If it could both insert and > > > balance the file-based tree then this would be ideal so I'll have > > > to look into some ways to do that. > > > > > > Another consideration would be storing all the values at the end > > > of the file so the headers run sequentially. Especially if the > > > values are longer, this could improve cache hits / buffering. > > > > > > It's a work-in-progress as I need to make some methods private, > > > make key-size configurable, add documentation and tests, then I > > > might see if I can upload to cpan. > > > > > > Anyways, just wanted to share. Let me know what you think. Always > > > enjoy the talks and the technical discussions that ensue :) > > > > > > > > > Best Regards, > > > Christopher Mevissen > > > > > > _______________________________________________ > > > Houston mailing list > > > Houston at pm.org > > > http://mail.pm.org/mailman/listinfo/houston > > > Website: http://houston.pm.org/ > > > > > > _______________________________________________ > Houston mailing list > Houston at pm.org > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/ -- Fortune knocks but once, but misfortune has much more patience. -- Laurence J. Peter From estrabd at gmail.com Sat Apr 30 15:38:49 2016 From: estrabd at gmail.com (B. Estrade) Date: Sat, 30 Apr 2016 17:38:49 -0500 Subject: [pm-h] Binary Search Tree File In-Reply-To: <20160430164057.1d16785f@cygnus> References: <58F725B6-480B-43A2-B656-0D96C425AA6B@gmail.com> <35DDCCA7-ED83-4330-8B7A-9B9E4A3A3230@gmail.com> <20160430164057.1d16785f@cygnus> Message-ID: http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/LFN02.pdf On Sat, Apr 30, 2016 at 4:40 PM, G. Wade Johnson via Houston wrote: > On Sat, 30 Apr 2016 19:14:24 +0200 > Reini Urban via Houston wrote: > > > No, the 1st cache argument is to skip the 2 unnecessary left and > > right pointers and use a trie instead. we do that even perl5. the 2nd > > cache argument is to use the Van Emde Boas layout and not the simple > > binary search trie layout. the 3rd cache argument, is to use a good > > radix trie, and not a BST at all. > > > > if memory is a problem a bloom filter in front of it will beat > > everything. my bloom filter deduper for 1G files used 100x less > > memory than all the others. > > While doing this on disk for a real system would benefit from a more > powerful data structure (I've seen B+ trees used effectively.) But, for > understanding the problem, a binary tree is easy to grasp and serves as > a good springboard to explore. > > In trying to learn the trade-offs in storing and retrieving this kind of > data, there is a lot of benefit in working through the problem from > this level. > > Chris, you should keep the group posted as you explore the problem and > potential solutions. > > G. Wade > > > Reini Urban > > rurban at cpan.org > > > > > > > > > On Apr 30, 2016, at 7:09 PM, Mev412 wrote: > > > > > > Can't say this was meant to be "state of the art". Todd mentioned > > > the perfect hashes, but the original problem was how to lower the > > > memory footprint as much as possible. So any data structure stored > > > completely in memory wouldn't be ideal. This was the motivation to > > > use something where searching could happen by seeking through the > > > file rather than in-memory. As far as L1 cache, the keys are stored > > > sequentially on disk so this should utilize cache a lot better than > > > random in-memory pointers. > > > > > > Best Regards, > > > Christopher Mevissen > > > > > > On Sat, Apr 30, 2016 at 1:51 AM, Reini Urban > > > wrote: Ordinary BST?s are not really state > > > of the art any more on modern CPU?s. The overhead of the two > > > absolute pointers trash the L1 cache, the very same problem as with > > > our perl5 op tree overhead. One of the reasons why perl5 is so > > > slow. Ditto linked lists. > > > > > > I also saw you trying it with Config, and there you can easily see > > > how my gperf (a static perfect hash) outperforms the BST. > > > > > > https://github.com/toddr/ConfigDat vs > > > https://github.com/perl11/p5-Config > > > > > > And the gperf hash is not always the best method, I just haven?t > > > had enough time to finish my Perfect::Hash module which comes up > > > with better characteristics then gperf in some cases. bulk88 > > > optimized the hell out of it lately. > > > https://github.com/rurban/Perfect-Hash#benchmarks > > > > > > State of the art besides properly implemented hash tables (i.e. NOT > > > perl5 hash tables) are Van Emde Boas binary search tries, which > > > perform much better than ordinary binary search tries, Note: > > > trie != tree. No right-left pointers needed. But even with the > > > advantage of a trie, the traditional binary search layout is not > > > optimal anymore. > > > > > > https://en.wikipedia.org/wiki/Van_Emde_Boas_tree > > > > > > radix trees with optimizations on word-sizes (Patricia trie) also > > > perform much better, e.g. judy or HAT-trie. A good HAT-trie is as > > > fast as a proper hash table, esp. for smaller sizes. > > > > > > Some links: > > > search for Cache Oblivious Search Tree > > > nice maps: > > > > https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf > > > > > > > > > Reini Urban > > > rurban at cpan.org > > > > > > > > > > > > > On Apr 30, 2016, at 5:31 AM, Mev412 via Houston > > > > wrote: > > > > > > > > The conversation at the last meeting sparked my interest to > > > > implement the file-based binary search tree. > > > > > > > > > https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm > > > > > > > > Build the file: > > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > > > $file->write_file({ test => "blah" }); > > > > > > > > Reading values: > > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > > > my $value = $file->get("test"); > > > > > > > > It performed well against a file-based, linear search. You can > > > > see how the linear search doubles as the records doubles. Haven't > > > > measured to see how close to O(log n) it is, but it appears to do > > > > well. It barely flinches when going from 1 to 2 million records. > > > > > > > > Time is seconds to locate single record (worst-case-scenario) > > > > > > > > # of records, binary-search-tree, linear > > > > 1024, 4.48226928710938e-05, 0.000963926315307617 > > > > 2048, 4.31537628173828e-05, 0.00278782844543457 > > > > 4096, 3.38554382324219e-05, 0.00162196159362793 > > > > 8192, 5.07831573486328e-05, 0.0121698379516602 > > > > 16384, 4.60147857666016e-05, 0.0115268230438232 > > > > 32768, 6.58035278320312e-05, 0.0142660140991211 > > > > 65536, 0.000729084014892578, 0.0285739898681641 > > > > 131072, 0.00218009948730469, 0.0539009571075439 > > > > 262144, 0.00141692161560059, 0.1079261302948 > > > > 524288, 0.0019831657409668, 0.214764833450317 > > > > 1048576, 0.00240302085876465, 0.434930086135864 > > > > 2097152, 0.00240802764892578, 0.875269889831543 > > > > > > > > The header format is > > > > [left-node position][right-node position][value > > > > length][key][value] > > > > > > > > It currently uses a static key size, so it can read in the key > > > > along with the rest of the header. This takes up more disk space > > > > but should be faster than an extra read. If there's any natural > > > > buffering of the file though then this may not incur a > > > > performance penalty so I'll have to benchmark. > > > > > > > > This is the main search logic > > > > > > > > my $header; > > > > read $fh, $header, $HEADER_SIZE; > > > > > > > > my $file_key = substr($header, 12, $KEY_SIZE); > > > > my $val_len = unpack("V", substr($header, 8, 4)); > > > > my $right = unpack("V", substr($header, 4, 4)); > > > > my $left = unpack("V", substr($header, 0, 4)); > > > > > > > > my $comp = $key cmp $file_key; > > > > > > > > if ($comp == 0) { > > > > my $val; > > > > read $fh, $val, $val_len; > > > > return $val; > > > > } > > > > elsif ($comp == -1) { > > > > if ($left == 0) { > > > > return undef; > > > > } > > > > > > > > $self->find_key($key, $left); > > > > } > > > > else { > > > > if ($right == 0) { > > > > return undef; > > > > } > > > > > > > > $self->find_key($key, $right); > > > > } > > > > > > > > The writing of the file sorts the key/value pairs, builds a BST, > > > > while building the BST a 'flat' list of the nodes is built along > > > > with the positions of their left and right node. Recording the > > > > position of the node itself made writing the file easier, then > > > > this is fed to a method which writes each node to the file. > > > > > > > > The writing of the file is not memory-efficient as it builds the > > > > BST in memory for simplicity, though this cost is only incurred > > > > once when the file is written. If it could both insert and > > > > balance the file-based tree then this would be ideal so I'll have > > > > to look into some ways to do that. > > > > > > > > Another consideration would be storing all the values at the end > > > > of the file so the headers run sequentially. Especially if the > > > > values are longer, this could improve cache hits / buffering. > > > > > > > > It's a work-in-progress as I need to make some methods private, > > > > make key-size configurable, add documentation and tests, then I > > > > might see if I can upload to cpan. > > > > > > > > Anyways, just wanted to share. Let me know what you think. Always > > > > enjoy the talks and the technical discussions that ensue :) > > > > > > > > > > > > Best Regards, > > > > Christopher Mevissen > > > > > > > > _______________________________________________ > > > > Houston mailing list > > > > Houston at pm.org > > > > http://mail.pm.org/mailman/listinfo/houston > > > > Website: http://houston.pm.org/ > > > > > > > > > > _______________________________________________ > > Houston mailing list > > Houston at pm.org > > http://mail.pm.org/mailman/listinfo/houston > > Website: http://houston.pm.org/ > > -- > Fortune knocks but once, but misfortune has much more patience. > -- Laurence J. Peter > _______________________________________________ > Houston mailing list > Houston at pm.org > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: