[Purdue-pm] Perl Challenge 96.2: Edit Distance

Ward, Mark Daniel mdw at purdue.edu
Tue Jan 19 03:13:24 PST 2021


Dear Dave,

Dan Gusfield’s book (1997) is a really readable introduction to algorithms about string comparisons.  Highly recommended.

https://www.amazon.com/dp/0521585198/

I also recommend Crochemore’s book:

https://www.amazon.com/dp/1107670993/

These books are both written in a really insightful way, if you are wanting to learn more about this.

Esko Ukkonen is another person who came up with some mind-blowing ways to shorten the runtime of some algorithms, if you get into long string comparisons and heavy computations.  There are slide decks by various people who discuss some of Ukkonen’s innovative approaches.

I wrote my Ph.D. thesis on problems related to string comparisons, and I’m still interested in this area, but I don’t stay on top of every single innovation.  It is a broad field.  I hope that this is a useful starting point.

Warmest regards,

Mark

Mark Daniel Ward, Ph.D.
Professor of Statistics and
(by courtesy) of Agricultural & Biological Engineering,
Mathematics, and Public Health;
Director of The Data Mine;
Interim Co-Director of the Integrative Data Science Initiative
Purdue University
1301 Third Street
West Lafayette, IN 47906-4206
mdw at purdue.edu
datamine at purdue.edu
phone: (765) 496-9563

From: Purdue-pm <purdue-pm-bounces+mdw=purdue.edu at pm.org> on behalf of Dave Jacoby <jacoby.david at gmail.com>
Date: Monday, January 18, 2021 at 8:59 PM
To: Purdue Perl Mongers <purdue-pm at pm.org>
Subject: [Purdue-pm] Perl Challenge 96.2: Edit Distance

I'm writing because I do not currently understand my next step. I may blog this. This is interesting because for most of the PWC challenges, there may be clever here or there, but mostly it's a simple matter of programming.

The Weekly Challenge - 096 (perlweeklychallenge.org)<https://perlweeklychallenge.org/blog/perl-weekly-challenge-096/>

You are given two strings $S1 and $S2.

Write a script to find out the minimum operations required to convert $S1 into $S2.
The operations can be insert, remove or replace a character.
Please check out Wikipedia page (Edit distance - Wikipedia<https://en.wikipedia.org/wiki/Edit_distance>) for more information.

The first example in the challenge and the wikipedia page is "kitten" and "sitting". There are three changes necessary to turn one to the other, shown by capitals.

    kitten
    Sitten
    sittIn
    sittinG

As mentioned in the Wikipedia article, we can get to the count of steps using the Levenshtein Distance. I know about that because I saw it in perlbrew internals: This is how it knows to
suggest "exec" when you accidentally type "perlbrew esec", but I hadn't looked into HOW it works inside. So, I can say "there are three operations" but I can't, within the program, say we have to insert a G, replace K with S, and replace E with I.

So I finally read to understand the LD algorithm, rather than restate it from Wikipedia and say that's that. To demonstrate, we'll start comparing "test" with "test".  In the below examples, the first row and the first column are preset, counting up from zero.

     t e s t
  [0,1,2,3,4]
t [1,0,1,2,3]
e [2,1,0,1,2]
s [3,2,1,0,1]
t [4,3,2,1,0]

If nothing is different in position [i,j], it simply copies the value of position i-1, j-i, and that transfers down,so that position [-1,-1], the last column of the last row, is 0, so there's no difference.

     t e s t
  [0,1,2,3,4]
t [1,0,1,2,3]
e [2,1,0,1,2]
x [3,2,1,1,2]
t [4,3,2,2,1]

Here we see that [3,3] is 1. It got that way because there is a difference, and took the minimum value of [2,2], [2,3], and [3,2], then added one.
     l i g h t s
  [0,1,2,3,4,5,6]
s [1,1,2,3,4,5,5]
l [2,1,2,3,4,5,6]
i [3,2,1,2,3,4,5]
g [4,3,2,1,2,3,4]
h [5,4,3,2,1,2,3]
t [6,5,4,3,2,1,2]

This compares "slight" with "lights", which, given the add, remove or change rules, give us two. This would be removing the S from the beginning and adding S to the end. 2 changes, as [-1,-1] indicates.

I'm thinking that the way forward is to travel up and to the left with the minimum of those three values, making no change if the minimum equals the current and

This Looks Like A Job For RECURSION! , but I believe most challenges look like that, until I think harder. Anyone else have thoughts about a better way to solve this?
--
Dave Jacoby
jacoby.david at gmail.com<mailto:jacoby.david at gmail.com>

“There is nothing obvious”
    — Theo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/purdue-pm/attachments/20210119/1cc39a05/attachment-0001.html>


More information about the Purdue-pm mailing list