[Chicago-talk] Regular expression discussion.
tzz at lifelogs.com
Thu Feb 3 07:44:39 PST 2011
On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina <richard at rushlogistics.com> wrote:
RR> Tired of shoveling snow. Well sit right down and lets have a regex
RR> discussion. I have a perl script that at the moment just uses grep to
RR> look though text files that have been converted from pdf2text to see
RR> what sort of documents they are. What I am finding however is that a
RR> lot of searches fail by just a few characters.
RR> For example, if I am looking for "This first document is a contract between" the text string in the file might look like this
RR> "This tirst document is a coniract betweeo" and the grep search
RR> fails. However, as you can see these two statements are 93% alike. Is
RR> there a way with perl regular expressions to match strings that are
RR> say 90, 95 or 98% alike?
Definitely not with regular expressions. This is usually called the
string distance; I first learned it in the context of Hamming codes but
there it's only used for substitutions. String distance turns out a lot
in bioinformatics as well, so there's plenty of research out there.
I would start with String::Approx as Warren suggested and it's the one
I've used, but also see
String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
seems most appropriate to what you're describing);
... which suggests Text::Levenshtein and String::Trigram as well.
More information about the Chicago-talk