[Chicago-talk] Regular expression discussion.

Thu Feb 3 17:07:31 PST 2011

One more comment...

How about running the text thru a spell checker.  Using the gmail
spell checker the following:
This tirst document is a coniract betweeo

was corrected to:
This test document is a contract between

I just used the first word that gmail suggested.

On Thu, Feb 3, 2011 at 8:02 PM, Michael Potter <michael at potter.name> wrote:
> It would be interesting to know if OCR usually gets word boundaries
> and character count in each word correct.  if so you might be able to
> leverage that in the search.
>
> On Thu, Feb 3, 2011 at 12:51 PM, Sean Blanton <sean at blanton.com> wrote:
>>> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
>>> seems most appropriate to what you're describing);
>> You should be able to create your own "keyboard" map, which is actually a
>> map of common OCR errors rather than typographical ones. t is near f and i
>> is near t, o near n, according to your example.
>> Here is an academic article that might help if you have several months to
>> spend on this problem:
>> http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php
>> Regards,
>> Sean
>>
>>
>>
>>
>> 2011/2/3 Ted Zlatanov <tzz at lifelogs.com>
>>>
>>> On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina
>>> <richard at rushlogistics.com> wrote:
>>>
>>> RR> Tired of shoveling snow. Well sit right down and lets have a regex
>>> RR> discussion. I have a perl script that at the moment just uses grep to
>>> RR> look though text files that have been converted from pdf2text to see
>>> RR> what sort of documents they are.  What I am finding however is that a
>>> RR> lot of searches fail by just a few characters.
>>> RR> For example, if I am looking for "This first document is a contract
>>> between" the text string in the file might look like this
>>> RR> "This tirst document is a coniract betweeo" and the grep search
>>> RR> fails. However, as you can see these two statements are 93% alike.  Is
>>> RR> there a way with perl regular expressions to match strings that are
>>> RR> say 90, 95 or 98% alike?
>>>
>>> Definitely not with regular expressions.  This is usually called the
>>> string distance; I first learned it in the context of Hamming codes but
>>> there it's only used for substitutions.  String distance turns out a lot
>>> in bioinformatics as well, so there's plenty of research out there.
>>>
>>> I would start with String::Approx as Warren suggested and it's the one
>>> I've used, but also see
>>>
>>> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
>>> seems most appropriate to what you're describing);
>>>
>>> http://www.perlmonks.org/?node_id=245428
>>>
>>> ... which suggests Text::Levenshtein and String::Trigram as well.
>>>
>>> Ted
>>> _______________________________________________
>>> Chicago-talk mailing list
>>> Chicago-talk at pm.org
>>> http://mail.pm.org/mailman/listinfo/chicago-talk
>>
>>
>> _______________________________________________
>> Chicago-talk mailing list
>> Chicago-talk at pm.org
>> http://mail.pm.org/mailman/listinfo/chicago-talk
>>
>
>
>
> --
> Michael Potter
> Replatform Technologies, LLC
> +1 770 815 6142
> michael at potter.name
>

-- 
Michael Potter
Replatform Technologies, LLC
+1 770 815 6142
michael at potter.name