[LA.pm] HTML page word count/density module?

Peter Benjamin pete at peterbenjamin.com
Thu Feb 19 21:04:58 CST 2004


At 05:42 PM 2/19/2004, Robert Spier wrote:
>Taking HTML out of the picture, this is _trivial_.  (And could be done
>in awk.)
>
>Changing HTML to text in a way suitable for this program is trivial.

I agree.  It's one line of perl.
Removing the embedded SCRIPT and STYLE - trivial too.

Weighting certain HTML tag text values makes it harder.
Like the TITLE, H1 to H6 tags, etc.  Search engines
weight those heavier, so this code must too.

Certain HTML attribute values, like ALT, NAME and TITLE tags,
must be flagged as coming from there.

So why not flag where all the text came from...

I was hoping to compare the output and methods used in
any sample code with mine, and make sure mine did not
miss any tricks.

--

How's the dinner going?  I had to stay home to prep for
www.LAMPSIG.org meeting on Saturday.  I co-host it.




More information about the Losangeles-pm mailing list