[Phoenix-pm] meeting tonight!

Scott Walters scott at illogics.org
Thu Mar 16 22:35:33 PST 2006


Hi,

To everyone who made it out, good to see you.  Thanks for coming.

Here's the handout, attached.

Now, who else has an interesting algorithm, trick, application of Perl,
module, technique, language, or project they'd be willing to talk about?

Regards,
-scott

On  0, Brock <awwaiid at thelackthereof.org> wrote:
> In case anyone gets this, I forgot to mention that it is probably best
> to park on 3rd st, either in a metered parking (free after 6) or in the
> garage (free 2hr validation).
> 
> The only trick is that the link between 3rd St. and Mill is closed, so
> you must approach the intersection from Ash St. You can see the road of
> which I speak on the map.
> 
> Here's the meeting info, in case you forgot:
> 
>      Time: Thursday 16 March 2006 @7:00pm
>  Location: Mill's End
>            http://maps.google.com/maps?q=310+S+Mill+Ave+tempe,+az
>            310 S Mill Ave # A101
>            Tempe, AZ 85281
>            Thats the North-West corner of Mill and 3rd (north of
>            University) in down-town Tempe.
>     Topic: Search Engine (scott)
>            Catalyst (Michael Garfias, unconfirmed)
>            As always, random discussion
>     Other: Wireless internet available. Bring your laptops.
>            Mill's End sells drinks and food
>            Presentations will be given and recorded over VNC
> _______________________________________________
> Phoenix-pm mailing list
> Phoenix-pm at pm.org
> http://mail.pm.org/mailman/listinfo/phoenix-pm
-------------- next part --------------

=for comment

* Patricia trees translate trees into positions in stored documents and 
  document names (file names, URLs, whatever)
* I hope you can read code better than diagrams

=cut

use strict; use warnings;

my $tree = { };

sub search {
    my $string = shift;
    my $node = $tree;
    for my $chr (split //, $string) {
        exists $node->{$chr} or return undef; # not found condition
        $node = $node->{$chr};
    }
    return $node->{hits};
}

=for comment

* The data structure is a tree, where each node has edges corresponding to 
  letters and other characters, and all nodes except the root node also have
  a list of "hit" documents and positions in those documents

* This implementation lacks a feature of searching up multiple words,
  but that's easy enough to add

* The decision of the implementer is how deep to make the tree -- to support
  searching for 40 character phrases, the tree must go 40 characters deep

=cut

sub index_document {
    my $doc = shift;
    my $doc_name = shift;
    my @chars = split //, $doc;
    my $position = 0;
    while(1) {
        return if @chars == 6;
        my $node = $tree;
        my $pos2 = $position;
        for my $chr ( @chars[0 .. (@chars > 6 ? 6 : scalar @chars)] ) {
            $node->{$chr} ||= { hits => [], };
            $node = $node->{$chr};
            push @{ $node->{hits} }, [ $doc_name, $position, ];
            $pos2++;
        }
    } continue {
        shift @chars;
        $position++;
    }
    
}

=for comment

* The index isn't based on merely words or substrings, so entire phrases
  can be searched for

* Google is able to quickly track down excerpts for each hit word in each
  document because its pattrees point to the exact byte in the exact document
  for each word

* Google is able to find ranges, such as when you Google for 1000..2000, 
  by following the pattree so far and then grabbing all of the nodes 
  under that point 

* Postgres defaults to using binary tables for lookups --
  this is an adequate simulation of a pattree if you ignore the overhead
  of storing repeated string fragments

=cut

sub pretty_results {
    my @hits = @{ shift() };
    my $buf = shift;
    for my $hit (@hits) {
        print $hit->[0], ' ', $hit->[1], ' ', substr($buf, $hit->[1], 20), "\n";
    } 
}

do {
    # the pudding
    open my $fh, '<', __FILE__;
    read $fh, my $buf, -s $fh;
    index_document($buf, __FILE__);
    my $hits = search('Google');
    pretty_results($hits, $buf);
};


More information about the Phoenix-pm mailing list