<meta http-equiv="content-type" content="text/html; charset=utf-8">&gt; String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and<br>&gt; seems most appropriate to what you&#39;re describing);<div><br></div>


<div>You should be able to create your own &quot;keyboard&quot; map, which is actually a map of common OCR errors rather than typographical ones. t is near f and i is near t, o near n, according to your example.</div><div>


<br></div><div>Here is an academic article that might help if you have several months to spend on this problem:</div><div><br></div><div><a href="http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php">http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php</a></div>


<div><a href="http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php"></a><br clear="all">Regards,<br>Sean<br><br><br>

<br><br><div class="gmail_quote">2011/2/3 Ted Zlatanov <span dir="ltr">&lt;<a href="mailto:tzz@lifelogs.com">tzz@lifelogs.com</a>&gt;</span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina &lt;<a href="mailto:richard@rushlogistics.com">richard@rushlogistics.com</a>&gt; wrote:<br>

<br>

RR&gt; Tired of shoveling snow. Well sit right down and lets have a regex<br>

RR&gt; discussion. I have a perl script that at the moment just uses grep to<br>

RR&gt; look though text files that have been converted from pdf2text to see<br>

RR&gt; what sort of documents they are.  What I am finding however is that a<br>

RR&gt; lot of searches fail by just a few characters.<br>

RR&gt; For example, if I am looking for &quot;This first document is a contract between&quot; the text string in the file might look like this<br>

RR&gt; &quot;This tirst document is a coniract betweeo&quot; and the grep search<br>

RR&gt; fails. However, as you can see these two statements are 93% alike.  Is<br>

RR&gt; there a way with perl regular expressions to match strings that are<br>

RR&gt; say 90, 95 or 98% alike?<br>

<br>

Definitely not with regular expressions.  This is usually called the<br>

string distance; I first learned it in the context of Hamming codes but<br>

there it&#39;s only used for substitutions.  String distance turns out a lot<br>

in bioinformatics as well, so there&#39;s plenty of research out there.<br>

<br>

I would start with String::Approx as Warren suggested and it&#39;s the one<br>

I&#39;ve used, but also see<br>

<br>

String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and<br>

seems most appropriate to what you&#39;re describing);<br>

<br>

<a href="http://www.perlmonks.org/?node_id=245428" target="_blank">http://www.perlmonks.org/?node_id=245428</a><br>

<br>

... which suggests Text::Levenshtein and String::Trigram as well.<br>

<font color="#888888"><br>

Ted<br>

</font><div><div></div><div class="h5">_______________________________________________<br>

Chicago-talk mailing list<br>

<a href="mailto:Chicago-talk@pm.org">Chicago-talk@pm.org</a><br>

<a href="http://mail.pm.org/mailman/listinfo/chicago-talk" target="_blank">http://mail.pm.org/mailman/listinfo/chicago-talk</a><br>

</div></div></blockquote></div><br></div>