[LA.pm] HTML page word count/density module?

Kevin Scaldeferri kevin at scaldeferri.com
Fri Feb 20 01:25:51 CST 2004


On Thursday, February 19, 2004, at 07:04 PM, Peter Benjamin wrote:

> At 05:42 PM 2/19/2004, Robert Spier wrote:
>> Taking HTML out of the picture, this is _trivial_.  (And could be done
>> in awk.)
>>
>> Changing HTML to text in a way suitable for this program is trivial.
>
> I agree.  It's one line of perl.
> Removing the embedded SCRIPT and STYLE - trivial too.
>
> Weighting certain HTML tag text values makes it harder.
> Like the TITLE, H1 to H6 tags, etc.  Search engines
> weight those heavier, so this code must too.
>
> Certain HTML attribute values, like ALT, NAME and TITLE tags,
> must be flagged as coming from there.
>
> So why not flag where all the text came from...
>
> I was hoping to compare the output and methods used in
> any sample code with mine, and make sure mine did not
> miss any tricks.
>

You might look at Nutch (www.nutch.org), which is an attempt to build 
an open-source web search engine.  I don't know if their indexing or 
scoring actually does anything like this, though.  Also, it's in Java.

Other than that, I expect that anything that exists along these lines 
is likely to be proprietary.




More information about the Losangeles-pm mailing list