[Chicago-talk] Regular expression discussion.

richard at rushlogistics.com richard at rushlogistics.com
Thu Feb 3 18:50:30 PST 2011


I think this might be a good idea as character  count and spacing is usually consistent. 
Watch our 3 minute movie: http://www.rushlogistics.com/movie

-----Original Message-----
From: Michael Potter <michael at potter.name>
Sender: chicago-talk-bounces+richard=rushlogistics.com at pm.org
Date: Thu, 3 Feb 2011 20:02:51 
To: Chicago.pm chatter<chicago-talk at pm.org>
Reply-To: "Chicago.pm chatter" <chicago-talk at pm.org>
Subject: Re: [Chicago-talk] Regular expression discussion.

It would be interesting to know if OCR usually gets word boundaries
and character count in each word correct.  if so you might be able to
leverage that in the search.

On Thu, Feb 3, 2011 at 12:51 PM, Sean Blanton <sean at blanton.com> wrote:
>> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
>> seems most appropriate to what you're describing);
> You should be able to create your own "keyboard" map, which is actually a
> map of common OCR errors rather than typographical ones. t is near f and i
> is near t, o near n, according to your example.
> Here is an academic article that might help if you have several months to
> spend on this problem:
> http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php
> Regards,
> Sean
>
>
>
>
> 2011/2/3 Ted Zlatanov <tzz at lifelogs.com>
>>
>> On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina
>> <richard at rushlogistics.com> wrote:
>>
>> RR> Tired of shoveling snow. Well sit right down and lets have a regex
>> RR> discussion. I have a perl script that at the moment just uses grep to
>> RR> look though text files that have been converted from pdf2text to see
>> RR> what sort of documents they are.  What I am finding however is that a
>> RR> lot of searches fail by just a few characters.
>> RR> For example, if I am looking for "This first document is a contract
>> between" the text string in the file might look like this
>> RR> "This tirst document is a coniract betweeo" and the grep search
>> RR> fails. However, as you can see these two statements are 93% alike.  Is
>> RR> there a way with perl regular expressions to match strings that are
>> RR> say 90, 95 or 98% alike?
>>
>> Definitely not with regular expressions.  This is usually called the
>> string distance; I first learned it in the context of Hamming codes but
>> there it's only used for substitutions.  String distance turns out a lot
>> in bioinformatics as well, so there's plenty of research out there.
>>
>> I would start with String::Approx as Warren suggested and it's the one
>> I've used, but also see
>>
>> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
>> seems most appropriate to what you're describing);
>>
>> http://www.perlmonks.org/?node_id=245428
>>
>> ... which suggests Text::Levenshtein and String::Trigram as well.
>>
>> Ted
>>_______________________________________________
>> Chicago-talk mailing list
>> Chicago-talk at pm.org
>> http://mail.pm.org/mailman/listinfo/chicago-talk
>
>
>_______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>



-- 
Michael Potter
Replatform Technologies, LLC
+1 770 815 6142
michael at potter.name
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk


More information about the Chicago-talk mailing list