[pm-h] Binary Search Tree File

Mev412 mev412 at gmail.com
Fri Apr 29 20:31:16 PDT 2016


The conversation at the last meeting sparked my interest to implement the
file-based binary search tree.

https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm

Build the file:
my $file = Tree::Binary::Search::File->new("/tmp/test-bst");
$file->write_file({ test => "blah" });

Reading values:
my $file = Tree::Binary::Search::File->new("/tmp/test-bst");
my $value = $file->get("test");

It performed well against a file-based, linear search. You can see how the
linear search doubles as the records doubles. Haven't measured to see how
close to O(log n) it is, but it appears to do well. It barely flinches when
going from 1 to 2 million records.

Time is seconds to locate single record (worst-case-scenario)

# of records, binary-search-tree, linear
1024, 4.48226928710938e-05, 0.000963926315307617
2048, 4.31537628173828e-05, 0.00278782844543457
4096, 3.38554382324219e-05, 0.00162196159362793
8192, 5.07831573486328e-05, 0.0121698379516602
16384, 4.60147857666016e-05, 0.0115268230438232
32768, 6.58035278320312e-05, 0.0142660140991211
65536, 0.000729084014892578, 0.0285739898681641
131072, 0.00218009948730469, 0.0539009571075439
262144, 0.00141692161560059, 0.1079261302948
524288, 0.0019831657409668, 0.214764833450317
1048576, 0.00240302085876465, 0.434930086135864
2097152, 0.00240802764892578, 0.875269889831543

The header format is
[left-node position][right-node position][value length][key][value]

It currently uses a static key size, so it can read in the key along with
the rest of the header. This takes up more disk space but should be faster
than an extra read. If there's any natural buffering of the file though
then this may not incur a performance penalty so I'll have to benchmark.

This is the main search logic

    my $header;
    read $fh, $header, $HEADER_SIZE;

    my $file_key = substr($header, 12, $KEY_SIZE);
    my $val_len  = unpack("V", substr($header, 8, 4));
    my $right    = unpack("V", substr($header, 4, 4));
    my $left     = unpack("V", substr($header, 0, 4));

    my $comp = $key cmp $file_key;

    if ($comp == 0) {
        my $val;
        read $fh, $val, $val_len;
        return $val;
    }
    elsif ($comp == -1) {
        if ($left == 0) {
            return undef;
        }

        $self->find_key($key, $left);
    }
    else {
        if ($right == 0) {
            return undef;
        }

        $self->find_key($key, $right);
    }

The writing of the file sorts the key/value pairs, builds a BST, while
building the BST a 'flat' list of the nodes is built along with the
positions of their left and right node. Recording the position of the node
itself made writing the file easier, then this is fed to a method which
writes each node to the file.

The writing of the file is not memory-efficient as it builds the BST in
memory for simplicity, though this cost is only incurred once when the file
is written. If it could both insert and balance the file-based tree then
this would be ideal so I'll have to look into some ways to do that.

Another consideration would be storing all the values at the end of the
file so the headers run sequentially. Especially if the values are longer,
this could improve cache hits / buffering.

It's a work-in-progress as I need to make some methods private, make
key-size configurable, add documentation and tests, then I might see if I
can upload to cpan.

Anyways, just wanted to share. Let me know what you think. Always enjoy the
talks and the technical discussions that ensue :)


Best Regards,
Christopher Mevissen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/houston/attachments/20160429/ec3c5f0d/attachment.html>


More information about the Houston mailing list