[mplspm]: finding plagerism

Tue Mar 12 18:43:54 CST 2002

I teach a network communications class at a local university by night, 
and by day I hack perl code to automate lots of good stuff.

After getting fed up with people copying whole paragraphs from one 
another, the book, or web sites, I decided why not write a quicky perl 
script to compare phrases from their submissions with a library of 
documents that I have.  I have identified the top couple of sites that 
the like to copy from (heck I use them for my own research) and so I 
can use a robot to copy down that content for my search purposes.

What I am thinking of is something very much like turnitin.com but 
without actually using their service.  Yes I am cheap - but more 
importantly I think it is a cool project.

What I am looking for are any good ideas of existing modules that might 
help me here.  I have looked through CPAN and haven't found anything 
off hand, but maybe I'm not using the right search terms.  

I guess I need two things - one a parsing engine to parse out key 
phrases - 4 to 8 words in length I am guessing, and then a search 
mechanism that works on these phrases.  

I have some ideas on the phrase engine - such as ignoring common words 
like "A", "An", "the", "I", etc. - maybe it should just ignore all 1-3 
letter words.

Any other ideas are appreciated.  Is there one of the search/matching 
modules that might work better than others?  If I can't find something 
I'll probably write it and put it out as my first real module of 
something I can actually release.

Thanks,
Dan

--------------------------------------------------
Minneapolis Perl Mongers mailing list

To unsubscribe, send mail to majordomo at pm.org
with "unsubscribe mpls" in the body of the message.