[Chicago-talk] Regular expression discussion.

Thu Feb 3 09:51:19 PST 2011

> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
> seems most appropriate to what you're describing);

You should be able to create your own "keyboard" map, which is actually a
map of common OCR errors rather than typographical ones. t is near f and i
is near t, o near n, according to your example.

Here is an academic article that might help if you have several months to
spend on this problem:

http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php
 <http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php>
Regards,
Sean

2011/2/3 Ted Zlatanov <tzz at lifelogs.com>

> On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina <
> richard at rushlogistics.com> wrote:
>
> RR> Tired of shoveling snow. Well sit right down and lets have a regex
> RR> discussion. I have a perl script that at the moment just uses grep to
> RR> look though text files that have been converted from pdf2text to see
> RR> what sort of documents they are.  What I am finding however is that a
> RR> lot of searches fail by just a few characters.
> RR> For example, if I am looking for "This first document is a contract
> between" the text string in the file might look like this
> RR> "This tirst document is a coniract betweeo" and the grep search
> RR> fails. However, as you can see these two statements are 93% alike.  Is
> RR> there a way with perl regular expressions to match strings that are
> RR> say 90, 95 or 98% alike?
>
> Definitely not with regular expressions.  This is usually called the
> string distance; I first learned it in the context of Hamming codes but
> there it's only used for substitutions.  String distance turns out a lot
> in bioinformatics as well, so there's plenty of research out there.
>
> I would start with String::Approx as Warren suggested and it's the one
> I've used, but also see
>
> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
> seems most appropriate to what you're describing);
>
> http://www.perlmonks.org/?node_id=245428
>
> ... which suggests Text::Levenshtein and String::Trigram as well.
>
> Ted
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20110203/874fbb66/attachment.html>