SPUG: removing common words

Dean Hudson dean at ero.com
Thu May 4 00:00:38 CDT 2000


On Wed, 3 May 2000, Christopher Cavnor wrote:

> Does anyone know of a module that can extract common words (aka "stop
> words") from a text file or scalar? Specifically, I want to parse
> something like:
> 
> "The foo that foo's it's foo is likely to foo time and time again" 
> to something like this -> "foo foo's foo likely foo time time again" 

Here are a couple lists I found by searching for "stop words", "stop words
lists" on google:

http://www.library.csustan.edu/catalog/doc/oclc5.htm
http://www.access.gpo.gov/su_docs/dpos/stopword.html
http://www.cqs.washington.edu/crisp/lit/stop.html

The lists seem suprisingly short, so you could probably whip something up
that has basic functionality pretty quickly...

dean.
--
my $email = qr{ dean(h)?@(?(1)verio\.net    # @ work if h
                            | ero\.com) }x; # other


 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
 Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
 SUBSCRIBE/UNSUBSCRIBE: Replace "action" below by subscribe or unsubscribe
           Email to majordomo at pm.org: "action" spug-list your_address





More information about the spug-list mailing list