[pgh-pm] Algorithms for finding words that are bases of others

Mon Apr 14 17:28:48 CDT 2003

On Thu, 2003-04-10 at 12:18, Casey West wrote:
> So David Hand was asking about ways to find all the words ending in
> 'ow' that, when the 'ow' is removed forms another word. [...]

This is a fun problem.

> In our conversation, we concluded that after an 'ow' word is found
> and a match is looked for, the words we have stored can be thrown
> away [...]  We were wrong, [...]  So all the words in the dictionary
> must be stored, by the time we get to the end.

You don't need to store all of the words.  Assuming that the input
dictionary is sorted, when you are examining words that start with "a"
and then you encounter the first word that starts with "b", you can
throw away the "a"-words you have remembered.  It is impossible for a
"b"-word that ends in an "ow" suffix to become an "a"-word when its
suffix is removed.  Going further, you can throw away all of your
"ab"-words when you encounter the first "ac"-word, and so on for
longer prefixes.

Thus, an efficient method for solving this kind of problem is to
construct a digital search tree that shares common prefixes.  For
example, we can imagine the following input:

    a
    ab
    abe
    at
    attempt

as a set of pathnames:

    a/@
    a/b/@
    a/b/e/@
    a/t/@
    a/t/t/e/m/p/t/@

where "@" represents the end of a word.  The pathnames correspond to
the following directory structure, which is how we will store
dictionaries in memory:

    a-+-@
      |
      |-b-+-@
      |   |
      |   `-e---@
      |
      `-t-+-@
          |
           `t---e---m---p---t---@

Notice how this representation lets us share common prefixes.  Even
though all five words start with "a", we only use one "a" directory
node at the root to represent this common prefix.  Likewise, both of
the "ab"-words share a common "b" node under the shared root node.
This is a big improvement over storing all of the words in a
dictionary.  Further, it allows for some cool optimizations,
which I'll talk about later.

Using search trees like this, it is fast and easy to solve our
original problem.  What we do is create search trees for our input
dictionary.  Then we search the trees for "@"-nodes, which mark the
ends of legal words.  When we find the end of a word, we check the
last few letters of its pathname to see if it matches our target
suffix (e.g., "ow").  If it matches, we climb back up the tree to the
ancestral directory node *before* the start of the suffix.  This
effectively chops off the suffix.  Then we check to see if the
ancestral node contains an "@"-node.  If it does, we know that
removing the suffix from the original word results in another legal
word, and we can output the original word as a "hit."  We continue
like this until we have exhausted our search trees.

> This, being a two step process, was dismissed out-of-hand in favor
> of the singular retroactive method.

At first, the search-tree method also appears to be a two-step
process:

    (1) Build search trees from the dictionary
    (2) Search the trees for hits

But we can be clever here.

Observing that the order in which the trees are built in step (1) is
the same as they are consumed in (2), we can build *and* consume the
trees in one pass.  Not only can we throw out the "a" search tree when
we start processing "b" words, but also we can throw out each of the
subtrees within the "a" tree when we advance to the next.  The same
goes for subsubtrees, subsubsubtrees, and so on.  In other words, we
need to keep in memory only that small depth-first search path that we
are actively considering!  This smashes our total memory consumption
down to a small constant proportional to the depth of the tree.
Further, we can reduce the storage even more by collapsing
non-terminal (non-@) nodes, since we never care about them during our
search.

It all boils down to this:

    #!/usr/bin/perl -l

    my ($sfx, @stack) = (shift().'$', "");
    while (defined($_ = <>)) {
        chomp;
        -- $#stack until $stack[-1] eq substr($_, 0, length $stack[-1]);
        if (((my $word_less_sfx = $_) =~ s/$sfx//o)) {
            print if grep({$_ eq $word_less_sfx}, at stack) && $_ ne $sfx;
        }
        push @stack, $_;
    }

Example:

    $ ./suffix-words.pl ow < words-lc
    allow
    barrow
    bellow
    bestow
    billow
    burrow
    endow
    fallow
    fellow
    hallow
    meadow
    pillow
    shallow
    tallow
    wallow
    willow
    window
    yellow

Cheers,
Tom