[Chicago-talk] Regular expression discussion.

Sean Blanton sean at blanton.com
Fri Feb 4 06:18:58 PST 2011


>
> How about running the text thru a spell checker.  Using the gmail
> spell checker the following:
> This tirst document is a coniract betweeo
> was corrected to:
> This test document is a contract between
> I just used the first word that gmail suggested.


Is there an API for this? So one could automate the choosing of the first
suggested word?

Regards,
Sean




On Thu, Feb 3, 2011 at 8:50 PM, <richard at rushlogistics.com> wrote:

> I think this might be a good idea as character  count and spacing is
> usually consistent.
> Watch our 3 minute movie: http://www.rushlogistics.com/movie
>
> -----Original Message-----
> From: Michael Potter <michael at potter.name>
> Sender: chicago-talk-bounces+richard=rushlogistics.com at pm.org
> Date: Thu, 3 Feb 2011 20:02:51
> To: Chicago.pm chatter<chicago-talk at pm.org>
> Reply-To: "Chicago.pm chatter" <chicago-talk at pm.org>
> Subject: Re: [Chicago-talk] Regular expression discussion.
>
> It would be interesting to know if OCR usually gets word boundaries
> and character count in each word correct.  if so you might be able to
> leverage that in the search.
>
> On Thu, Feb 3, 2011 at 12:51 PM, Sean Blanton <sean at blanton.com> wrote:
> >> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
> >> seems most appropriate to what you're describing);
> > You should be able to create your own "keyboard" map, which is actually a
> > map of common OCR errors rather than typographical ones. t is near f and
> i
> > is near t, o near n, according to your example.
> > Here is an academic article that might help if you have several months to
> > spend on this problem:
> > http://archive.nlm.nih.gov/pubs/hauser/Tompaper/tompaper.php
> > Regards,
> > Sean
> >
> >
> >
> >
> > 2011/2/3 Ted Zlatanov <tzz at lifelogs.com>
> >>
> >> On Wed, 02 Feb 2011 08:55:36 -0500 (EST) Richard Reina
> >> <richard at rushlogistics.com> wrote:
> >>
> >> RR> Tired of shoveling snow. Well sit right down and lets have a regex
> >> RR> discussion. I have a perl script that at the moment just uses grep
> to
> >> RR> look though text files that have been converted from pdf2text to see
> >> RR> what sort of documents they are.  What I am finding however is that
> a
> >> RR> lot of searches fail by just a few characters.
> >> RR> For example, if I am looking for "This first document is a contract
> >> between" the text string in the file might look like this
> >> RR> "This tirst document is a coniract betweeo" and the grep search
> >> RR> fails. However, as you can see these two statements are 93% alike.
>  Is
> >> RR> there a way with perl regular expressions to match strings that are
> >> RR> say 90, 95 or 98% alike?
> >>
> >> Definitely not with regular expressions.  This is usually called the
> >> string distance; I first learned it in the context of Hamming codes but
> >> there it's only used for substitutions.  String distance turns out a lot
> >> in bioinformatics as well, so there's plenty of research out there.
> >>
> >> I would start with String::Approx as Warren suggested and it's the one
> >> I've used, but also see
> >>
> >> String::KeyboardDistance (which can do QWERTY and Dvorak US layouts, and
> >> seems most appropriate to what you're describing);
> >>
> >> http://www.perlmonks.org/?node_id=245428
> >>
> >> ... which suggests Text::Levenshtein and String::Trigram as well.
> >>
> >> Ted
> >>_______________________________________________
> >> Chicago-talk mailing list
> >> Chicago-talk at pm.org
> >> http://mail.pm.org/mailman/listinfo/chicago-talk
> >
> >
> >_______________________________________________
> > Chicago-talk mailing list
> > Chicago-talk at pm.org
> > http://mail.pm.org/mailman/listinfo/chicago-talk
> >
>
>
>
> --
> Michael Potter
> Replatform Technologies, LLC
> +1 770 815 6142 <tel:+17708156142>
> michael at potter.name
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20110204/9e636773/attachment-0001.html>


More information about the Chicago-talk mailing list