[Neworleans-pm] Fwd: Solutions and Discussion for Perl Quiz of the Week #22 (Expert Edition)

Wed Sep 8 21:26:14 CDT 2004

=====
http://www.brettsbsd.net/~estrabd

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

----- Original message -----
From: "Daniel Martin" <martin at snowplow.org>
To: perl-qotw at plover.com
Date: Wed, 08 Sep 2004 22:06:47 -0400
Subject: Solutions and Discussion for Perl Quiz of the Week #22 (Expert
Edition)

Sample solutions and discussion
Perl Quiz of The Week #22 (20040825)

        Write a program, 'wordladder', which gets two arguments, which
        are
        words of the same length, and which constructs and prints a
        "word
        ladder" from the first word to the second word.

        A word ladder from word AAA to word BBB is a sequence of
        dictionary
        words such that:

        1. the first word in the sequence is word AAA
        2. each word in the sequence after the first differs from the
        previous
           word in exactly one letter position
        3. the last word in the sequence is word BBB

        For example, given the two words "love" and "hate", the program
        might
        print the word ladder:

                love
                hove
                have
                hate

        Or it might print:

                love
                lave
                have
                hate

        It might also print a longer word ladder, such as

                love
                lore
                lobe
                robe
                role
                rose
                lose
                lost
                most
                mosh
                moth
                math
                hath
                hate

        If the program is unable to find a word ladder, it should print
        an
        appropriate error message to the standard error, and exit with a
        failure status.

        The program should also accept an optional third argument,
        which, if
        specified, is the name of a dictionary file which contains the
        permissible words.  If the third argument is omitted, the
        program
        should use a default dictionary.

        Sample word lists are available from
                http://perl.plover.com/qotw/words/

---------------------------------------------------

[ MJD: This week's report is a collaboration between three people.
       Ron Isaacson tested and timed the many solutions and wrote up a
       timing report.  Daniel Martin read over the solutions and
       analyzed the algorithms used.  I selected one sample solution
       and wrote up an explanation of how it works. My very grateful
       thanks to Prs. Isaacson and Martin for bailing me out of this
       one when I bit of far more than I could chew.  I now turn you
       over to Pr. Martin. ]

[Daniel Martin]

This was a popular quiz, judging by the number of solutions.  23
solutions were submitted from 21 different people.  Three solutions
weren't tested for one reason or another, leaving 21 solutions to be
tested and timed.  Many thanks to Ron Isaacson for doing the
time-testing.

The overall algorithm used by each solution can be classified as
follows:

- "Expanding circles":

  1) Start with just the word AAA as your working set.  

  2) Find every word which is only one letter different from something
     in the working set and that hasn't been seen before and make that
     your new working set.

  3) Repeat (2) until BBB is in the working set.

  This solution can be viewed as Dijkstra's shortest-path algorithm
  applied to the specifics of this problem.
  (see http://en.wikipedia.org/wiki/Dijkstra's_algorithm)

  A variant of this has the circles expanding from both AAA and BBB,
  and stopping when the two sets intersect.

  One solution, from Rod Adams, computed each subsequent expanding
  circle by gathering the words that might be in the next working set
  and then eliminating those that were not adjacent to something in
  the previous working set.  (It then switched to the more standard
  version based on an estimate as to which method would be faster)

- "workqueue":
  1) Place AAA in a queue.
  2) Pull a word from the queue and push onto the queue all the
     adjacent words we haven't seen before.
  3) Repeat (2) until BBB is pulled from the queue. (Variation:
     until BBB is about to be pushed onto the queue)

  This solution can be viewed as the A* algorithm applied to
  the specifics of this problem with no estimate of future
  path length.  (see 
  http://en.wikipedia.org/wiki/A-star_search_algorithm)

  A few people extended this by using a full-blown priority
  queue, with the priority of a word being 
  (distance from w to BBB) + (length of path to w from AAA)

  This is the A* algorithm with a distance estimate.

  As before, there were also bi-directional variants that attacked
  the path from both ends, stopping when the words seen intersected.

  Yes, sometimes the classification between "workqueue" and "expanding
  circles" was a bit of a judgement call about how the code was
  organized overall.  In general if 

- "depth first":
  1) Start at AAA.
  2) Pick an adjacent word that hasn't been visited yet.
  3) Repeat (2) until you reach BBB or visit everything.
     If there are no available adjacent words, backtrack.

  Only two solutions attempted this.  Both also sorted the
  adjacent words so as to first try words closer to BBB.

After overall algorithm choice, there were still several other design
decisions that could be made.  A common source of variation was in how
the dictionary was stored.  Most solutions used a variation on this
design:  (code adapted from Colin Meyer's solution)

      open DICT, '<', $DICT or die "can't open $DICT: $!";
      @words = grep length( $_ ) == $SIZE,
               map { chomp; lc } <DICT>;
      close DICT;

      for my $w ( @words ) {
        my @rungs = map { my $c = $w; 
                          substr( $c, $_, 1 ) = '_';
                          $c } 0..($SIZE-1);;
        push @{ $rung2word{ $_ } }, $w for @rungs;
      }

That is, stored the dictionary as a hash of pattern => wordlist, where
"pattern" is the word in question with one character replaced by an
'_' (or '?' or '.' or '\0') and "wordlist" is a reference to all words
with that pattern.  Occasionally, the map word => patternlist would
also be recorded.

Storing the words in a hash that mapped each word to 1 (or some other
value) was also popular.  (So that keys %words was used whenever the
full list of words was needed)

A few stored the dictionary in a single giant array, searching through
it with regexps or with Inline::C functions when finding adjacent
words.

There were also some non-perl solutions submitted: 1 in PIR (Parrot
Intermediate Representation - assembly language for Perl 6, sort of),
2 in python, and 1 in scheme.  Unfortunately, I have no PIR
interpreter and neither of my scheme interpreters was able to handle
the submitted scheme solution (guile chokes on the variable name with
asterisks in it and mzscheme doesn't know about 'sort').  The python
solutions were tested along with the others.

Several of the solutions failed to check for certain edge cases.  For
example, if AAA == BBB, 4 solutions found no path.  3 more solutions
found a path only if AAA was adjacent to something (so testing on
"transubstantiationalist transubstantiationalist" showed no path for
those, but testing "love love" did find a path)  Also, some solutions
had trouble with very large dictionaries or with long paths.

At least one solution made the erroneous assumption that if AAA and
BBB differ only in positions 1, 3, and 5, then all the words along the
path from AAA to BBB will differ from each other only in positions 1,
3, and 5.  This is shown false by the path from "axal" to "utah" in
the Web2 dictionary, which among other words passes through "itch".

Because of the large number of solutions, I won't go through each
one and list the classification.  Instead, this chart might be useful:

            dictionary ->|sig>wordlist|word>1|@DICT
        search method    |            |      |
        ==================================================
        Expanding circles|     0      |  1   |   2
          2-dir version  |     5      |  0   |   1
        --------------------------------------------------
        workqueue        |     3      |  2   |   0
          + distance ver |     2      |  0   |   0
          2-dir version  |     0      |  1   |   0
        --------------------------------------------------
        depth first      |     0      |  2   |   1

I didn't classify the scheme solution from Greg Bacon, nor the PIR
solution from Ingo Blechschmidt.  I also did not classify the perl
solution sent in by David B., since I was unable to puzzle it out.  (I
was also unable to test it - it's doing something that eats memory
like crazy.)

I will note that in general, the depth first solutions all performed
very poorly on this quiz, despite two of them coding crucial parts in
C.  The overall fastest solution was mine, the 2-direction workqueue,
though not by much.

Ron Isaacson's beautifully-formatted timing report is available at:

        http://perl.plover.com/qotw/misc/e022/report.html

The raw timing data, and the submitted programs, are available at

        http://perl.plover.com/qotw/misc/e022/

Although the problem merely specified _a_ path from AAA to BBB, many
solutions found the shortest path.  Therefore, some people speculated
on what the "longest shortest path" might be for different word lengths
in the different dictionaries.  The results can be found in the thread
starting here:

  http://perl.plover.com/~alias/list.cgi?1:mss:2054

----------------------------------------------------------------

[Mark Dominus]

I've decided to use Zed Lopez's program as this week's sample
solution.  According to Ron Isaacson's report, it's consistently one
of the very fastest sumbissions, and it's much shorter than the other
comparably fast solutions.

Part of that may be due to Zed's use of the Tree::Simple module, which
provides functions that manage tree structures:

        #!/usr/bin/perl
        use strict;
        use warnings;
        use Tree::Simple;

And part of it may be due to a somewhat excessively terse coding
style, as seen here:

        my ($start_word, $destination_word, $dictionary_file) = @ARGV;
        die "You must specify two words of equal length" unless defined
        $start_word and defined $destination_word and length $start_word
        ==
        length $destination_word;

But on the whole I found the program straightforward and easy to
understand.

In Daniel Martin's classification, this was an "expanding circles"
solution, of the "two directions at once variety".  Pr. Lopez's
program maintains two working sets of words, and operates on them
alternately.  On each pass, it finds all the words that are one step
removed from the words in the current working set, and replaces the
current working set with this new set of words.  When it finds a word
that is already in the *other* working set, it knows it has found a
complete path.

        $dictionary_file ||= "/usr/share/dict/web2";
        open (FH, $dictionary_file) or die "Couldn't open
        $dictionary_file: $!";

        $start_word = lc $start_word;
        $destination_word = lc $destination_word;

        my %dict;
        while (<FH>) {
            chomp;
            $dict{lc $_} = 1 if length $_ == length $start_word and $_
            !~ /[^a-zA-Z]/;
        }
        close(FH);
        die "$start_word is not in dictionary" unless exists
        $dict{$start_word};
        die "$destination_word is not in dictionary" unless exists
        $dict{$destination_word};

The dictionary is the usual Perl implementation of a set, with words
as keys, and the values always 1.  

This code handles the special case if the start word and the end word
being the same.  Had I thought about it in advance, I might have ruled
this out when I posed the problem:

        print "$start_word\n" and exit if $start_word eq
        $destination_word;

$list and $next_nodes are the main data structures of the program:

        my ($list, $next_nodes);
        {
            my $top = Tree::Simple->new($start_word);
            my $bottom = Tree::Simple->new($destination_word);

            $list = [{$start_word => $top}, {$destination_word =>
            $bottom}];
            $next_nodes = [[$top],[$bottom]];
        }

The Tree::Simple objects keep track of tree structures whose roots are
labeled with the start and end words; each tree node is labeled with
another word.  If BBB is in the start-word tree, then a word ladder
from the start word to BBB can be found by tracing the path from BBB
back to the root.  The program's wordchain() function does this:

        sub word_chain {
            my $node = shift;
            my @words;
            while (1) {
                push @words, $node->getNodeValue();
                last if $node->isRoot;
                $node = $node->getParent;
            }
            return @words;
        }

(->getNodeValue returns the word with which the node is labeled.)

$list contains exactly two hashes, one working 'down' from the source
word and the other working 'up' from the destination word.  Each hash
maps words to the Tree::Simple objects that represent them.

$next_nodes contains the two 'working sets'.

The main loop of the program is:

        for (my $x = 0; ; $x = !$x) {
            $next_nodes->[$x] = find_next_nodes($next_nodes->[$x], $x);
        }

The '$x = !$x' is a little obscure; he really means '$x = 1 - $x'
here.  The main loop alternates between the two working sets,
replacing the old working set with the set of words adjacent to it.
Calculating this new set of words is the job of find_next_nodes():

        sub find_next_nodes {
            my ($nodes, $x) = @_;
            my @next_nodes;
            for my $node (@$nodes) {
                my $orig_word = $node->getNodeValue();
                for (my $i = 0; $i < length $orig_word; $i++) {
                    my $word = $orig_word;
                    for my $char ('a'..'z') {
                        next if $char eq substr $orig_word, $i, 1;
                        substr $word, $i, 1, $char;

This last line uses the four-argument form of substr(), which is
newish.  It was introduced by analogy with the four-argument form of
splice().  substr($word, $i, 1, $char) locates the length-1 substring
of $word at position $i, just like substr($word, $i, 1), but then
replaces this substring with the contents of $char.  It is just like

>MJD>   substr($word, $i, 1) = $char;

only faster.

The program has just calculated a new word, $word, that is one step
removed from some word $orig_word that was in the working set.  If
this new word is already in the *other* working set, then a complete
path has been found:

        # if word is in other list, we're done

                        success($x ? ($list->[!$x]->{$word}, $node) :
                        ($node,
        $list->[!$x]->{$word})) if exists $list->[!$x]->{$word};

I would have formatted this differently.  I think it's clearer like
this:

>MJD>                   success($x ? ($list->[1-$x]->{$word}, $node) 
>MJD>                              : ($node, $list->[1-$x]->{$word}))
>MJD>                       if exists $list->[1-$x]->{$word};

The success() function, which we'll see shortly, is responsible for
tracing the paths from the common word back to the roots of the two
trees, and generating the resulting word ladder.  Its arguments are
the nodes of the two trees that have the same label, with the node
from the top-down tree first.  If $x is true (that is, 1) then the
current working set is the bottom-up one, so needs to be put second in
the argument list of success().

If the program hasn't been succesful, it extends the current tree
structure and the new working set:

                        if (exists $dict{$word} and !exists
                        $list->[$x]->{$word}) {
                            my $child = Tree::Simple->new($word);
                            $node->addChild($child); 
                            push @next_nodes, $child;
                            $list->[$x]->{$word} = $child;
                        }
                    }
                }

If there were no new words that could be reached from the current
node, it is a dead end and is deleted from the tree structure to save
memory:

                prune($node) if $node->isLeaf;
            }
            return \@next_nodes;
        }

Here's prune(), which deletes the dead end.  After the dead-end node
is deleted, prune() checks to see if its parent is now a dead end also:

        sub prune {
            my $node = shift;
            while ($node->isLeaf) {
                die "No path exists" if $node->isRoot;
                my $parent = $node->getParent;
                $parent->removeChild($node);
                $node = $parent;
            }
        }

The success() function was called when two nodes were found with the
same label, one in each of thee trees.  It uses word_chain() to find
the word ladders from the common word up to the source word (at $node1)
and down to the destination word (at $node2) and appends the two
ladders together:

        sub success {
            my ($node1, $node2) = @_;
            print join "\n", (reverse word_chain($node1)),
            word_chain($node2), '';
            exit;
        }

The function avoids listing the common word twice because the common
word is never added to the second tree.  One of the arguments to
success() is the node containing the common word; the other is the
node in the other tree that would have been adjacent to the
common-word node.

Once again, thanks to everyone who participated by sending a solution
or adding to the discusion, and thanks also to everyone who
participated *without* sending a solution or adding to the discussion.
Thanks especially to Ron Isaacson and Daniel Martin.

John Trammell is working hard on the writeup of last week's
parenthesis-generating quiz; I will send that along when it is ready.
I hope to send out the new quiz tonight.