[Chicago-talk] Regular expression discussion.
Sean Blanton
sean at blanton.com
Thu Feb 3 09:51:19 PST 2011
> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
> seems most appropriate to what you're describing);
You should be able to create your own "keyboard" map, which is actually a
map of common OCR errors rather than typographical ones. t is near f and i
is near t, o near n, according to your example.
Here is an academic article that might help if you have several months to
spend on this problem:
http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php
<http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php>
Regards,
Sean
2011/2/3 Ted Zlatanov <tzz at lifelogs.com>
> On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina <
> richard at rushlogistics.com> wrote:
>
> RR> Tired of shoveling snow. Well sit right down and lets have a regex
> RR> discussion. I have a perl script that at the moment just uses grep to
> RR> look though text files that have been converted from pdf2text to see
> RR> what sort of documents they are. What I am finding however is that a
> RR> lot of searches fail by just a few characters.
> RR> For example, if I am looking for "This first document is a contract
> between" the text string in the file might look like this
> RR> "This tirst document is a coniract betweeo" and the grep search
> RR> fails. However, as you can see these two statements are 93% alike. Is
> RR> there a way with perl regular expressions to match strings that are
> RR> say 90, 95 or 98% alike?
>
> Definitely not with regular expressions. This is usually called the
> string distance; I first learned it in the context of Hamming codes but
> there it's only used for substitutions. String distance turns out a lot
> in bioinformatics as well, so there's plenty of research out there.
>
> I would start with String::Approx as Warren suggested and it's the one
> I've used, but also see
>
> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
> seems most appropriate to what you're describing);
>
> http://www.perlmonks.org/?node_id=245428
>
> ... which suggests Text::Levenshtein and String::Trigram as well.
>
> Ted
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20110203/874fbb66/attachment.html>
More information about the Chicago-talk
mailing list