[pm-h] document similarity demo script related to lightning talk

B. Estrade estrabd at gmail.com
Fri Nov 14 14:03:38 PST 2014


First, I really enjoyed last night. I learned a lot of really cool things.
If you think what you don't have to say is of no interest, think again :)

Now, here is a more sophisticated method for determining the similarity
between any 2 give documents.  In the case of the script, I comparing a
sampling of eBay item titles. It is taken directly out of Section 5.7 of
Practical Text Mining With Perl. I just cleaned it up and modified it for
my purposes.

The result is a square matrix ( MxM given M documents) that relates all
"documents" to the other, the final value is a measure of similarity for 1
(exact) to 0.

https://github.com/estrabd/lightning-talks/tree/master/houston-pm-13-nov-2014-text-mining

I forgot to mention last night that the method uses what is called a "bag
of words" model - meaning that word order doesn't matter.  Word order may
be considered using "n-grams" - or strings of ordered words, and I imagine
the the same method may apply - it just greatly increases the number of
entries in each document vector.

There's a lot to this book, so maybe I'll have something interesting the
next time we do another round of these talks.

Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/houston/attachments/20141114/20613a14/attachment.html>


More information about the Houston mailing list