OxPM: Search and Extract

Neil Hoggarth neil.hoggarth at physiol.ox.ac.uk
Thu Dec 5 05:22:40 CST 2002


On Thu, 5 Dec 2002, Julian Martin wrote:

> I would like to search some html pages for a keyword and then extract
> the <p>blah, blah......keyword........blah</p> and then put the
> <p>blah, blah......keyword........blah</p>'s into a results page. Any
> pointers would be great ! I have Perl cookbook for any reference but
> cannot find something like this in it.

You could set the input record seperator ("$/", perldoc perlvar for
info) to "<p>", then the kind of while(<>) loop that would normally
process input line-by-line will work paragraph by paragraph. The only
wrinkle would be that the <p> tags would be regarded as the end of the
preceding record rather than part of the paragraph that they start, so
given input like:

<p>para one</p>
<p>para
two</p>
<p>para three</p>

the records in $_ in successive loops would be:

1. <p>

2. para one</p>
   <p>

3. para
   two</p>
   <p>

4. para three</p>

If you know that all the HTML that you will be dealing with will be
sufficently well formed then you could use "</p>" as your record
seperator. A lot of HTML in the wild lacks closing tags where browsers
don't require them though.

Regards,
-- 
Neil Hoggarth                                 Departmental Computer Officer
<neil.hoggarth at physiol.ox.ac.uk>                   Laboratory of Physiology
http://www.physiol.ox.ac.uk/~njh/                  University of Oxford, UK



More information about the Oxford-pm mailing list