OxPM: Search and Extract

Thu Dec 5 05:44:08 CST 2002

On Thu 05 Dec 2002, Julian Martin <julianmartin at ntlworld.com> wrote:
> I would like to search some html pages for a keyword and then
> extract the <p>blah, blah......keyword........blah</p> and then put
> the <p>blah, blah......keyword........blah</p>'s into a results page.

HTML::PullParser is nice for parsing HTML.  See
  http://search.cpan.org/author/GAAS/HTML-Parser-3.26/lib/HTML/PullParser.pm

and see it in action in the format sub of CGI::Wiki at
  http://search.cpan.org/src/KAKE/CGI-Wiki-0.05/lib/CGI/Wiki.pm

You could probably do something like the following (untested).

------------------------------------------------------------

my ( $buffer, $matched, @results );
my $parser = HTML::PullParser->new( doc   => $html_page_content,
                                    start => '"START", tag, text',
                                    end   => '"END", tag, text',
                                    text  => '"TEXT", tag, text'   );
while ( my $token = $parser->get_token ) {
    my ( $flag, $tag, $text ) = @$token;

    if ( $flag eq "START" and lc($tag) eq "p" ) { # start of a new paragraph
        # If the current buffer matched, add it to the results.
        push @results, $buffer if $matched;
        # Reinitialise the buffer and reset the "matched" flag.
        $buffer = "";
        $matched = 0;
    }

    $buffer .= $text; # whatever this token is, we want it in the buffer

    # Put your keyword matching stuff in here, and set $matched to 1
    # if it does match.

}

# @results should now be an array of strings, each one containing the
# HTML for a paragraph which matched your keywords, and it should be
# in the order in which the paragraphs appeared on the page.
# It won't have the very last paragraph if that matched, though - see below.

------------------------------------------------------------

But I just wrote that off the top of my head, so don't just cut and
paste it blindly, read the docs and check it does what I think it does.

You'll also want to add something that checks $matched and pushes the
relevant stuff onto @results after the *last* occurrence of <p> in the
page, because as it stands it only updates @results when it sees a <p>
- either be clever inside the while loop, or do something after it
finishes.  You won't just want to push $buffer on, because it will
contain "</body></html>" or similar.  Left as an exercise cos I only
just thought of it.

Also note that you probably can't count on the page you're parsing
having well-formed HTML in it, so make sure you write tests for edge cases.

There might be a simpler way to do what you want, though, and I think
you might possibly be going about it the wrong way by looking at the
document as a set of paragraphs (unless you really can rely on it
being well-structured into short paragraphs).  Think about things like
what if the entire page is one long paragraph?  You'd get the whole
page returned.  Then again most screenscrapers do rely on assumptions,
so you're not alone :)

Kake
who wrote a screenscraper for a wiki
-- 
http://www.earth.li/~kake/cookery/ - vegan recipes, now with new search feature
http://grault.net/grubstreet/ - the open-source guide to London
http://www.penseroso.com/ - websites for the fine art and antique trade