<meta http-equiv="content-type" content="text/html; charset=utf-8"><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
<span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; ">How about running the text thru a spell checker. Using the gmail<br></span>spell checker the following:<br>
This tirst document is a coniract betweeo<br>was corrected to:<br>This test document is a contract between<br>I just used the first word that gmail suggested.</blockquote><blockquote><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
</blockquote></blockquote><blockquote><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
</blockquote></blockquote><blockquote><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
</blockquote></blockquote><blockquote><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
</blockquote></blockquote><div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><br></span></font></div><div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;">Is there an API for this? So one could automate the choosing of the first suggested word?</span></font></div>
<div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><br clear="all"></span></font>Regards,<br>Sean<br><br><br>
<br><br><div class="gmail_quote">On Thu, Feb 3, 2011 at 8:50 PM, <span dir="ltr"><<a href="mailto:richard@rushlogistics.com">richard@rushlogistics.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
I think this might be a good idea as character count and spacing is usually consistent.<br>
<div class="im">Watch our 3 minute movie: <a href="http://www.rushlogistics.com/movie" target="_blank">http://www.rushlogistics.com/movie</a><br>
<br>
</div><div><div></div><div class="h5">-----Original Message-----<br>
From: Michael Potter <<a href="mailto:michael@potter.name">michael@potter.name</a>><br>
Sender: chicago-talk-bounces+richard=<a href="http://rushlogistics.com" target="_blank">rushlogistics.com</a>@<a href="http://pm.org" target="_blank">pm.org</a><br>
Date: Thu, 3 Feb 2011 20:02:51<br>
To: Chicago.pm chatter<<a href="mailto:chicago-talk@pm.org">chicago-talk@pm.org</a>><br>
Reply-To: "Chicago.pm chatter" <<a href="mailto:chicago-talk@pm.org">chicago-talk@pm.org</a>><br>
Subject: Re: [Chicago-talk] Regular expression discussion.<br>
<br>
It would be interesting to know if OCR usually gets word boundaries<br>
and character count in each word correct. if so you might be able to<br>
leverage that in the search.<br>
<br>
On Thu, Feb 3, 2011 at 12:51 PM, Sean Blanton <<a href="mailto:sean@blanton.com">sean@blanton.com</a>> wrote:<br>
>> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and<br>
>> seems most appropriate to what you're describing);<br>
> You should be able to create your own "keyboard" map, which is actually a<br>
> map of common OCR errors rather than typographical ones. t is near f and i<br>
> is near t, o near n, according to your example.<br>
> Here is an academic article that might help if you have several months to<br>
> spend on this problem:<br>
> <a href="http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php" target="_blank">http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php</a><br>
> Regards,<br>
> Sean<br>
><br>
><br>
><br>
><br>
> 2011/2/3 Ted Zlatanov <<a href="mailto:tzz@lifelogs.com">tzz@lifelogs.com</a>><br>
>><br>
>> On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina<br>
>> <<a href="mailto:richard@rushlogistics.com">richard@rushlogistics.com</a>> wrote:<br>
>><br>
>> RR> Tired of shoveling snow. Well sit right down and lets have a regex<br>
>> RR> discussion. I have a perl script that at the moment just uses grep to<br>
>> RR> look though text files that have been converted from pdf2text to see<br>
>> RR> what sort of documents they are. What I am finding however is that a<br>
>> RR> lot of searches fail by just a few characters.<br>
>> RR> For example, if I am looking for "This first document is a contract<br>
>> between" the text string in the file might look like this<br>
>> RR> "This tirst document is a coniract betweeo" and the grep search<br>
>> RR> fails. However, as you can see these two statements are 93% alike. Is<br>
>> RR> there a way with perl regular expressions to match strings that are<br>
>> RR> say 90, 95 or 98% alike?<br>
>><br>
>> Definitely not with regular expressions. This is usually called the<br>
>> string distance; I first learned it in the context of Hamming codes but<br>
>> there it's only used for substitutions. String distance turns out a lot<br>
>> in bioinformatics as well, so there's plenty of research out there.<br>
>><br>
>> I would start with String::Approx as Warren suggested and it's the one<br>
>> I've used, but also see<br>
>><br>
>> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and<br>
>> seems most appropriate to what you're describing);<br>
>><br>
>> <a href="http://www.perlmonks.org/?node_id=245428" target="_blank">http://www.perlmonks.org/?node_id=245428</a><br>
>><br>
>> ... which suggests Text::Levenshtein and String::Trigram as well.<br>
>><br>
>> Ted<br>
>>_______________________________________________<br>
>> Chicago-talk mailing list<br>
>> <a href="mailto:Chicago-talk@pm.org">Chicago-talk@pm.org</a><br>
>> <a href="http://mail.pm.org/mailman/listinfo/chicago-talk" target="_blank">http://mail.pm.org/mailman/listinfo/chicago-talk</a><br>
><br>
><br>
>_______________________________________________<br>
> Chicago-talk mailing list<br>
> <a href="mailto:Chicago-talk@pm.org">Chicago-talk@pm.org</a><br>
> <a href="http://mail.pm.org/mailman/listinfo/chicago-talk" target="_blank">http://mail.pm.org/mailman/listinfo/chicago-talk</a><br>
><br>
<br>
<br>
<br>
--<br>
Michael Potter<br>
Replatform Technologies, LLC<br>
<a href="tel:+17708156142">+1 770 815 6142</a><br>
<a href="mailto:michael@potter.name">michael@potter.name</a><br>
_______________________________________________<br>
Chicago-talk mailing list<br>
<a href="mailto:Chicago-talk@pm.org">Chicago-talk@pm.org</a><br>
<a href="http://mail.pm.org/mailman/listinfo/chicago-talk" target="_blank">http://mail.pm.org/mailman/listinfo/chicago-talk</a><br>
_______________________________________________<br>
Chicago-talk mailing list<br>
<a href="mailto:Chicago-talk@pm.org">Chicago-talk@pm.org</a><br>
<a href="http://mail.pm.org/mailman/listinfo/chicago-talk" target="_blank">http://mail.pm.org/mailman/listinfo/chicago-talk</a><br>
</div></div></blockquote></div><br></div>