[Kc] Solutions and Discussion for Perl Quiz of the Week #22 (Expert Ed ition)

Garrett Goebel garrett at scriptpro.com
Thu Sep 9 08:42:06 CDT 2004


Sample solutions and discussion
Perl Quiz of The Week #22 (20040825)

        Write a program, 'wordladder', which gets two arguments, which are
        words of the same length, and which constructs and prints a "word
        ladder" from the first word to the second word.
        
        A word ladder from word AAA to word BBB is a sequence of dictionary
        words such that:
        
        1. the first word in the sequence is word AAA
        2. each word in the sequence after the first differs from the
previous
           word in exactly one letter position
        3. the last word in the sequence is word BBB
        
        For example, given the two words "love" and "hate", the program
might
        print the word ladder:
        
                love
                hove
                have
                hate
        
        Or it might print:
        
                love
                lave
                have
                hate
        
        It might also print a longer word ladder, such as
        
                love
                lore
                lobe
                robe
                role
                rose
                lose
                lost
                most
                mosh
                moth
                math
                hath
                hate
        
        If the program is unable to find a word ladder, it should print an
        appropriate error message to the standard error, and exit with a
        failure status.
        
        The program should also accept an optional third argument, which, if
        specified, is the name of a dictionary file which contains the
        permissible words.  If the third argument is omitted, the program
        should use a default dictionary.
        
        Sample word lists are available from
                http://perl.plover.com/qotw/words/
        
---------------------------------------------------

[ MJD: This week's report is a collaboration between three people.
       Ron Isaacson tested and timed the many solutions and wrote up a
       timing report.  Daniel Martin read over the solutions and
       analyzed the algorithms used.  I selected one sample solution
       and wrote up an explanation of how it works. My very grateful
       thanks to Prs. Isaacson and Martin for bailing me out of this
       one when I bit of far more than I could chew.  I now turn you
       over to Pr. Martin. ]

[Daniel Martin]

This was a popular quiz, judging by the number of solutions.  23 solutions
were submitted from 21 different people.  Three solutions weren't tested for
one reason or another, leaving 21 solutions to be tested and timed.  Many
thanks to Ron Isaacson for doing the time-testing.

The overall algorithm used by each solution can be classified as
follows:

- "Expanding circles":

  1) Start with just the word AAA as your working set.  

  2) Find every word which is only one letter different from something
     in the working set and that hasn't been seen before and make that
     your new working set.

  3) Repeat (2) until BBB is in the working set.

  This solution can be viewed as Dijkstra's shortest-path algorithm
  applied to the specifics of this problem.
  (see http://en.wikipedia.org/wiki/Dijkstra's_algorithm)

  A variant of this has the circles expanding from both AAA and BBB,
  and stopping when the two sets intersect.

  One solution, from Rod Adams, computed each subsequent expanding
  circle by gathering the words that might be in the next working set
  and then eliminating those that were not adjacent to something in
  the previous working set.  (It then switched to the more standard
  version based on an estimate as to which method would be faster)

- "workqueue":
  1) Place AAA in a queue.
  2) Pull a word from the queue and push onto the queue all the
     adjacent words we haven't seen before.
  3) Repeat (2) until BBB is pulled from the queue. (Variation:
     until BBB is about to be pushed onto the queue)

  This solution can be viewed as the A* algorithm applied to
  the specifics of this problem with no estimate of future
  path length.  (see
  http://en.wikipedia.org/wiki/A-star_search_algorithm)

  A few people extended this by using a full-blown priority
  queue, with the priority of a word being
  (distance from w to BBB) + (length of path to w from AAA)

  This is the A* algorithm with a distance estimate.

  As before, there were also bi-directional variants that attacked
  the path from both ends, stopping when the words seen intersected.

  Yes, sometimes the classification between "workqueue" and "expanding
  circles" was a bit of a judgement call about how the code was
  organized overall.  In general if 

- "depth first":
  1) Start at AAA.
  2) Pick an adjacent word that hasn't been visited yet.
  3) Repeat (2) until you reach BBB or visit everything.
     If there are no available adjacent words, backtrack.

  Only two solutions attempted this.  Both also sorted the
  adjacent words so as to first try words closer to BBB.

After overall algorithm choice, there were still several other design
decisions that could be made.  A common source of variation was in how the
dictionary was stored.  Most solutions used a variation on this
design:  (code adapted from Colin Meyer's solution)

      open DICT, '<', $DICT or die "can't open $DICT: $!";
      @words = grep length( $_ ) == $SIZE,
               map { chomp; lc } <DICT>;
      close DICT;

      for my $w ( @words ) {
        my @rungs = map { my $c = $w; 
                          substr( $c, $_, 1 ) = '_';
                          $c } 0..($SIZE-1);;
        push @{ $rung2word{ $_ } }, $w for @rungs;
      }

That is, stored the dictionary as a hash of pattern => wordlist, where
"pattern" is the word in question with one character replaced by an '_' (or
'?' or '.' or '\0') and "wordlist" is a reference to all words with that
pattern.  Occasionally, the map word => patternlist would also be recorded.

Storing the words in a hash that mapped each word to 1 (or some other
value) was also popular.  (So that keys %words was used whenever the full
list of words was needed)

A few stored the dictionary in a single giant array, searching through it
with regexps or with Inline::C functions when finding adjacent words.

There were also some non-perl solutions submitted: 1 in PIR (Parrot
Intermediate Representation - assembly language for Perl 6, sort of),
2 in python, and 1 in scheme.  Unfortunately, I have no PIR interpreter and
neither of my scheme interpreters was able to handle the submitted scheme
solution (guile chokes on the variable name with asterisks in it and
mzscheme doesn't know about 'sort').  The python solutions were tested along
with the others.

Several of the solutions failed to check for certain edge cases.  For
example, if AAA == BBB, 4 solutions found no path.  3 more solutions found a
path only if AAA was adjacent to something (so testing on
"transubstantiationalist transubstantiationalist" showed no path for those,
but testing "love love" did find a path)  Also, some solutions had trouble
with very large dictionaries or with long paths.

At least one solution made the erroneous assumption that if AAA and BBB
differ only in positions 1, 3, and 5, then all the words along the path from
AAA to BBB will differ from each other only in positions 1, 3, and 5.  This
is shown false by the path from "axal" to "utah" in the Web2 dictionary,
which among other words passes through "itch".

Because of the large number of solutions, I won't go through each one and
list the classification.  Instead, this chart might be useful:

            dictionary ->|sig>wordlist|word>1|@DICT
        search method    |            |      |
        ==================================================
        Expanding circles|     0      |  1   |   2
          2-dir version  |     5      |  0   |   1
        --------------------------------------------------
        workqueue        |     3      |  2   |   0
          + distance ver |     2      |  0   |   0
          2-dir version  |     0      |  1   |   0
        --------------------------------------------------
        depth first      |     0      |  2   |   1

I didn't classify the scheme solution from Greg Bacon, nor the PIR solution
from Ingo Blechschmidt.  I also did not classify the perl solution sent in
by David B., since I was unable to puzzle it out.  (I was also unable to
test it - it's doing something that eats memory like crazy.)

I will note that in general, the depth first solutions all performed very
poorly on this quiz, despite two of them coding crucial parts in C.  The
overall fastest solution was mine, the 2-direction workqueue, though not by
much.

Ron Isaacson's beautifully-formatted timing report is available at:

        http://perl.plover.com/qotw/misc/e022/report.html

The raw timing data, and the submitted programs, are available at

        http://perl.plover.com/qotw/misc/e022/

Although the problem merely specified _a_ path from AAA to BBB, many
solutions found the shortest path.  Therefore, some people speculated on
what the "longest shortest path" might be for different word lengths in the
different dictionaries.  The results can be found in the thread starting
here:

  http://perl.plover.com/~alias/list.cgi?1:mss:2054

----------------------------------------------------------------

[Mark Dominus]

I've decided to use Zed Lopez's program as this week's sample solution.
According to Ron Isaacson's report, it's consistently one of the very
fastest sumbissions, and it's much shorter than the other comparably fast
solutions.

Part of that may be due to Zed's use of the Tree::Simple module, which
provides functions that manage tree structures:

        #!/usr/bin/perl
        use strict;
        use warnings;
        use Tree::Simple;

And part of it may be due to a somewhat excessively terse coding style, as
seen here:

        my ($start_word, $destination_word, $dictionary_file) = @ARGV;
        die "You must specify two words of equal length" unless defined
        $start_word and defined $destination_word and length $start_word ==
        length $destination_word;

But on the whole I found the program straightforward and easy to understand.

In Daniel Martin's classification, this was an "expanding circles"
solution, of the "two directions at once variety".  Pr. Lopez's program
maintains two working sets of words, and operates on them alternately.  On
each pass, it finds all the words that are one step removed from the words
in the current working set, and replaces the current working set with this
new set of words.  When it finds a word that is already in the *other*
working set, it knows it has found a complete path.

        $dictionary_file ||= "/usr/share/dict/web2";
        open (FH, $dictionary_file) or die "Couldn't open $dictionary_file:
$!";

        $start_word = lc $start_word;
        $destination_word = lc $destination_word;

        my %dict;
        while (<FH>) {
            chomp;
            $dict{lc $_} = 1 if length $_ == length $start_word and $_ !~
/[^a-zA-Z]/;
        }
        close(FH);
        die "$start_word is not in dictionary" unless exists
$dict{$start_word};
        die "$destination_word is not in dictionary" unless exists
        $dict{$destination_word};

The dictionary is the usual Perl implementation of a set, with words as
keys, and the values always 1.  

This code handles the special case if the start word and the end word being
the same.  Had I thought about it in advance, I might have ruled this out
when I posed the problem:

        print "$start_word\n" and exit if $start_word eq $destination_word;

$list and $next_nodes are the main data structures of the program:

        my ($list, $next_nodes);
        {
            my $top = Tree::Simple->new($start_word);
            my $bottom = Tree::Simple->new($destination_word);

            $list = [{$start_word => $top}, {$destination_word => $bottom}];
            $next_nodes = [[$top],[$bottom]];
        }

The Tree::Simple objects keep track of tree structures whose roots are
labeled with the start and end words; each tree node is labeled with another
word.  If BBB is in the start-word tree, then a word ladder from the start
word to BBB can be found by tracing the path from BBB back to the root.  The
program's wordchain() function does this:

        sub word_chain {
            my $node = shift;
            my @words;
            while (1) {
                push @words, $node->getNodeValue();
                last if $node->isRoot;
                $node = $node->getParent;
            }
            return @words;
        }

(->getNodeValue returns the word with which the node is labeled.)

$list contains exactly two hashes, one working 'down' from the source word
and the other working 'up' from the destination word.  Each hash maps words
to the Tree::Simple objects that represent them.

$next_nodes contains the two 'working sets'.

The main loop of the program is:

        for (my $x = 0; ; $x = !$x) {
            $next_nodes->[$x] = find_next_nodes($next_nodes->[$x], $x);
        }

The '$x = !$x' is a little obscure; he really means '$x = 1 - $x'
here.  The main loop alternates between the two working sets, replacing the
old working set with the set of words adjacent to it.
Calculating this new set of words is the job of find_next_nodes():

        sub find_next_nodes {
            my ($nodes, $x) = @_;
            my @next_nodes;
            for my $node (@$nodes) {
                my $orig_word = $node->getNodeValue();
                for (my $i = 0; $i < length $orig_word; $i++) {
                    my $word = $orig_word;
                    for my $char ('a'..'z') {
                        next if $char eq substr $orig_word, $i, 1;
                        substr $word, $i, 1, $char;

This last line uses the four-argument form of substr(), which is newish.  It
was introduced by analogy with the four-argument form of splice().
substr($word, $i, 1, $char) locates the length-1 substring of $word at
position $i, just like substr($word, $i, 1), but then replaces this
substring with the contents of $char.  It is just like

>MJD>   substr($word, $i, 1) = $char;

only faster.

The program has just calculated a new word, $word, that is one step removed
from some word $orig_word that was in the working set.  If this new word is
already in the *other* working set, then a complete path has been found:


        # if word is in other list, we're done

                        success($x ? ($list->[!$x]->{$word}, $node) :
($node,
        $list->[!$x]->{$word})) if exists $list->[!$x]->{$word};

I would have formatted this differently.  I think it's clearer like
this:

>MJD>                   success($x ? ($list->[1-$x]->{$word}, $node) 
>MJD>                              : ($node, $list->[1-$x]->{$word}))
>MJD>                       if exists $list->[1-$x]->{$word};

The success() function, which we'll see shortly, is responsible for tracing
the paths from the common word back to the roots of the two trees, and
generating the resulting word ladder.  Its arguments are the nodes of the
two trees that have the same label, with the node from the top-down tree
first.  If $x is true (that is, 1) then the current working set is the
bottom-up one, so needs to be put second in the argument list of success().

If the program hasn't been succesful, it extends the current tree structure
and the new working set:

                        if (exists $dict{$word} and !exists
$list->[$x]->{$word}) {
                            my $child = Tree::Simple->new($word);
                            $node->addChild($child); 
                            push @next_nodes, $child;
                            $list->[$x]->{$word} = $child;
                        }
                    }
                }

If there were no new words that could be reached from the current node, it
is a dead end and is deleted from the tree structure to save
memory:

                prune($node) if $node->isLeaf;
            }
            return \@next_nodes;
        }


Here's prune(), which deletes the dead end.  After the dead-end node is
deleted, prune() checks to see if its parent is now a dead end also:

        sub prune {
            my $node = shift;
            while ($node->isLeaf) {
                die "No path exists" if $node->isRoot;
                my $parent = $node->getParent;
                $parent->removeChild($node);
                $node = $parent;
            }
        }


The success() function was called when two nodes were found with the same
label, one in each of thee trees.  It uses word_chain() to find the word
ladders from the common word up to the source word (at $node1) and down to
the destination word (at $node2) and appends the two ladders together:

        sub success {
            my ($node1, $node2) = @_;
            print join "\n", (reverse word_chain($node1)),
word_chain($node2), '';
            exit;
        }

The function avoids listing the common word twice because the common word is
never added to the second tree.  One of the arguments to
success() is the node containing the common word; the other is the node in
the other tree that would have been adjacent to the common-word node.

Once again, thanks to everyone who participated by sending a solution or
adding to the discusion, and thanks also to everyone who participated
*without* sending a solution or adding to the discussion.
Thanks especially to Ron Isaacson and Daniel Martin.

John Trammell is working hard on the writeup of last week's
parenthesis-generating quiz; I will send that along when it is ready.
I hope to send out the new quiz tonight.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/kc/attachments/20040909/e8dc3477/attachment-0001.htm


More information about the kc mailing list