[LA.pm] HTML page word count/density module?
Peter Benjamin
pete at peterbenjamin.com
Thu Feb 19 18:39:17 CST 2004
Oh, what was I thinking? HTML is sort of outside regular perl.
But really good with HTML files and all the modules that support it.
At 04:02 PM 2/19/2004, Benjamin J. Tilly wrote:
>What you're asking for is highly unclear, but whatever
>you want to wind up with, HTML::TokeParser is likely to
>be useful in building it.
Yes, it could be, if I had understood the documentation
the first time I read it. Tokenizing an HTML page can
be done with a few REs to separate each token with a
newline.
Search Engine Optimizing a web site means knowing what keywords
to salt the page with. So doing word counts, now called word
densities when a percentage is calculated is common place.
Here is an example output.
perl wordcount.pl ls1.list
Word Word
Count Percent Word
----- ------- -----------------
559 4.232 the
320 2.423 to
211 1.597 for
209 1.582 of
205 1.552 a
199 1.507 and
...
...
...
where ls1.list is a list of pathnames to HTML files,
one per record.
More information about the Losangeles-pm
mailing list