[Chicago-talk] Regular expression discussion.

Thu Feb 3 07:44:39 PST 2011

On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina <richard at rushlogistics.com> wrote: 

RR> Tired of shoveling snow. Well sit right down and lets have a regex
RR> discussion. I have a perl script that at the moment just uses grep to
RR> look though text files that have been converted from pdf2text to see
RR> what sort of documents they are.  What I am finding however is that a
RR> lot of searches fail by just a few characters.
RR> For example, if I am looking for "This first document is a contract between" the text string in the file might look like this 
RR> "This tirst document is a coniract betweeo" and the grep search
RR> fails. However, as you can see these two statements are 93% alike.  Is
RR> there a way with perl regular expressions to match strings that are
RR> say 90, 95 or 98% alike?

Definitely not with regular expressions.  This is usually called the
string distance; I first learned it in the context of Hamming codes but
there it's only used for substitutions.  String distance turns out a lot
in bioinformatics as well, so there's plenty of research out there.

I would start with String::Approx as Warren suggested and it's the one
I've used, but also see

String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
seems most appropriate to what you're describing);

http://www.perlmonks.org/?node_id=245428

... which suggests Text::Levenshtein and String::Trigram as well.

Ted