[mplspm]: finding plagerism
Dan Oelke
dan at oelke.com
Tue Mar 12 18:43:54 CST 2002
I teach a network communications class at a local university by night,
and by day I hack perl code to automate lots of good stuff.
After getting fed up with people copying whole paragraphs from one
another, the book, or web sites, I decided why not write a quicky perl
script to compare phrases from their submissions with a library of
documents that I have. I have identified the top couple of sites that
the like to copy from (heck I use them for my own research) and so I
can use a robot to copy down that content for my search purposes.
What I am thinking of is something very much like turnitin.com but
without actually using their service. Yes I am cheap - but more
importantly I think it is a cool project.
What I am looking for are any good ideas of existing modules that might
help me here. I have looked through CPAN and haven't found anything
off hand, but maybe I'm not using the right search terms.
I guess I need two things - one a parsing engine to parse out key
phrases - 4 to 8 words in length I am guessing, and then a search
mechanism that works on these phrases.
I have some ideas on the phrase engine - such as ignoring common words
like "A", "An", "the", "I", etc. - maybe it should just ignore all 1-3
letter words.
Any other ideas are appreciated. Is there one of the search/matching
modules that might work better than others? If I can't find something
I'll probably write it and put it out as my first real module of
something I can actually release.
Thanks,
Dan
--------------------------------------------------
Minneapolis Perl Mongers mailing list
To unsubscribe, send mail to majordomo at pm.org
with "unsubscribe mpls" in the body of the message.
More information about the Mpls-pm
mailing list