[LA.pm] HTML page word count/density module?

Peter Benjamin pete at peterbenjamin.com
Thu Feb 19 18:39:17 CST 2004


Oh, what was I thinking?  HTML is sort of outside regular perl.
But really good with HTML files and all the modules that support it.

At 04:02 PM 2/19/2004, Benjamin J. Tilly wrote:
>What you're asking for is highly unclear, but whatever
>you want to wind up with, HTML::TokeParser is likely to
>be useful in building it.

Yes, it could be, if I had understood the documentation
the first time I read it.  Tokenizing an HTML page can
be done with a few REs to separate each token with a 
newline.

Search Engine Optimizing a web site means knowing what keywords
to salt the page with.  So doing word counts, now called word
densities when a percentage is calculated is common place.

Here is an example output.


perl wordcount.pl ls1.list 

Word    Word    
Count   Percent Word
-----   ------- -----------------
559     4.232   the
320     2.423   to
211     1.597   for
209     1.582   of
205     1.552   a
199     1.507   and
...
...
...

where ls1.list is a list of pathnames to HTML files,
one per record.




More information about the Losangeles-pm mailing list