From julianmartin at ntlworld.com Thu Dec 5 05:00:28 2002 From: julianmartin at ntlworld.com (Julian Martin) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract Message-ID: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> Hi I would like to search some html pages for a keyword and then extract the
blah, blah......keyword........blah
and then put theblah, blah......keyword........blah
's into a results page. Any pointers would be great ! I have Perl cookbook for any reference but cannot find something like this in it. Thanks Julian. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.pm.org/archives/oxford-pm/attachments/20021205/bca7e68e/attachment.htm From Kavanagm at oup.co.uk Thu Dec 5 05:15:31 2002 From: Kavanagm at oup.co.uk (KAVANAGH, Michael) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract Message-ID: <852ED745A1B8D411BBD900B0D0789E0C08A1CB98@EXC05.oup.co.uk> Hi Julian: Have you tried CPAN? HTML::Index::Search -Mike Kavanagh -----Original Message----- From: Julian Martin [mailto:julianmartin@ntlworld.com] Sent: Thursday, December 05, 2002 11:00 AM To: oxford-pm-list@happyfunball.pm.org Subject: OxPM: Search and Extract Hi I would like to search some html pages for a keyword and then extract theblah, blah......keyword........blah
and then put theblah, blah......keyword........blah
's into a results page. Any pointers would be great ! I have Perl cookbook for any reference but cannot find something like this in it. Thanks Julian. From Kevin.ADM-Gibbs at Alcan.Com Thu Dec 5 05:18:56 2002 From: Kevin.ADM-Gibbs at Alcan.Com (Kevin.ADM-Gibbs@Alcan.Com) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract Message-ID:) you are interested in. You'll then need to use regular
expressions to determine if the text contains your keyword. Alternatively
you could use regular expressions to do the whole thing but that could be
trickier.
Cheers,
Kev.
"Julian Martin"
blah, blah......keyword........blah blah,
blah......keyword........blah blah, blah......keyword........blah blah, blah......keyword........blah ", then the kind of while(<>) loop that would normally
process input line-by-line will work paragraph by paragraph. The only
wrinkle would be that the tags would be regarded as the end of the
preceding record rather than part of the paragraph that they start, so
given input like:
para one para
two para three
2. para one
3. para
two
4. para three
blah, blah......keyword........blah
and then put > theblah, blah......keyword........blah
's into a results page. HTML::PullParser is nice for parsing HTML. See http://search.cpan.org/author/GAAS/HTML-Parser-3.26/lib/HTML/PullParser.pm and see it in action in the format sub of CGI::Wiki at http://search.cpan.org/src/KAKE/CGI-Wiki-0.05/lib/CGI/Wiki.pm You could probably do something like the following (untested). ------------------------------------------------------------ my ( $buffer, $matched, @results ); my $parser = HTML::PullParser->new( doc => $html_page_content, start => '"START", tag, text', end => '"END", tag, text', text => '"TEXT", tag, text' ); while ( my $token = $parser->get_token ) { my ( $flag, $tag, $text ) = @$token; if ( $flag eq "START" and lc($tag) eq "p" ) { # start of a new paragraph # If the current buffer matched, add it to the results. push @results, $buffer if $matched; # Reinitialise the buffer and reset the "matched" flag. $buffer = ""; $matched = 0; } $buffer .= $text; # whatever this token is, we want it in the buffer # Put your keyword matching stuff in here, and set $matched to 1 # if it does match. } # @results should now be an array of strings, each one containing the # HTML for a paragraph which matched your keywords, and it should be # in the order in which the paragraphs appeared on the page. # It won't have the very last paragraph if that matched, though - see below. ------------------------------------------------------------ But I just wrote that off the top of my head, so don't just cut and paste it blindly, read the docs and check it does what I think it does. You'll also want to add something that checks $matched and pushes the relevant stuff onto @results after the *last* occurrence ofin the page, because as it stands it only updates @results when it sees a
- either be clever inside the while loop, or do something after it finishes. You won't just want to push $buffer on, because it will contain "