[pgh-pm] Algorithms for finding words that are bases of others

Casey West casey at geeknest.com
Thu Apr 10 11:18:53 CDT 2003


So David Hand was asking about ways to find all the words ending in
'ow' that, when the 'ow' is removed forms another word.  His initial
approach, which I have labeled 'retroactive', goes through the
dictionary storing each word it finds.  When a word ends in 'ow', he
examines the past list of words for one matching $word-'ow'.

In our conversation, we concluded that after an 'ow' word is found and
a match is looked for, the words we have stored can be thrown away
because we assumed that these sequences would be 'in order'.  For
instance, 'ball' and 'ballow'.  We were wrong, because wrapping that
is 'b', and 'bow'.  So all the words in the dictionary must be stored,
by the time we get to the end.

An approach that I came up with was a dictionary algorithm, basically
storing every entry in a hash (where the value is meaningless, but
lookups are faster) and then looking for matches.  This being a two
step process was dismissed out-of-hand in favor of the singular
retroactive method.

A third, mostly silly approach was a single regular expression that
returned a list of matches.

This was bothering me a bit last night and I decided to figure out
which was best.  I tried to profile my code as best I could, and take
out anything that might increase overhead.  You'll note that I
couldn't run the regex.  See, it works, I tested it on small data and
I find it rather neat, but it never finished one iteration, I gave up
around twelve minutes into the runtime.

Benchmark: timing 5 iterations of directory, retroactive...
 directory: 36 wallclock secs (31.01 usr +  0.00 sys = 31.01 CPU) @ 0.16/s (n=5)
retroactive: 313 wallclock secs (266.55 usr +  0.00 sys = 266.55 CPU) @  0.02/s (n=5)
            s/iter retroactive   directory
retroactive   53.3          --        -88%
directory     6.20        760%          --

And for your pleasure, what follows is the code:

#!/usr/bin/env perl
use strict;
use warnings;

=pod

Find words ending in C<$SUFFIX> that
are also words without C<$SUFFIX>.

=cut

my $SUFFIX  =  q[ow];
my $DICT    =  q[/usr/share/dict/words];

use Benchmark qw[cmpthese];

cmpthese 5, {
             retroactive => \&retroactive,
             directory   => \&directory,
#            regex       => \&regex,
};

regex();

sub retroactive {
  my @words = ();
  my @found = ();

  open  DICT, $DICT or die $!;
  while ( <DICT> ) {
    chomp;
    $_ = lc;
    if ( substr( $_, -2 ) eq $SUFFIX ) {
      my $root
        = substr $_, 0, length( $SUFFIX ) * -1;
      if ( grep { $_ eq $root } reverse @words ) {
        push @found, $_;
      }
    }
    push @words, $_;
  }
  close DICT;
# print "$_\n" foreach @found;
}

sub directory {
  my %dir   = ();
  my @found = ();
  open  DICT, $DICT or die $!;
  %dir = map { chomp; lc $_ => undef } <DICT>;
  close DICT;

  foreach ( keys %dir ) {
    push @found, "$_$SUFFIX"
      if exists $dir{"$_$SUFFIX"};
  }
# print "$_\n" foreach @found;
}

sub regex {
  open  DICT, $DICT or die $!;
  my $string = join "|", map { chomp; lc $_ } <DICT>;
  close DICT;

  study $string;
  my (@found)
    = ( $string =~ /\b(\w+)\b(?=.+\b\1$SUFFIX\b)/gso );

# print "$_\n" foreach @found;
}

__END__

  Casey West

-- 
Shooting yourself in the foot with dBase 
You buy a gun. Bullets are only available from another company and are
promised to work so you buy them. Then you find out that the next
version of the gun is the one scheduled to actually shoot bullets. 




More information about the pgh-pm mailing list