OxPM: Search and Extract
Neil Hoggarth
neil.hoggarth at physiol.ox.ac.uk
Thu Dec 5 05:22:40 CST 2002
On Thu, 5 Dec 2002, Julian Martin wrote:
> I would like to search some html pages for a keyword and then extract
> the <p>blah, blah......keyword........blah</p> and then put the
> <p>blah, blah......keyword........blah</p>'s into a results page. Any
> pointers would be great ! I have Perl cookbook for any reference but
> cannot find something like this in it.
You could set the input record seperator ("$/", perldoc perlvar for
info) to "<p>", then the kind of while(<>) loop that would normally
process input line-by-line will work paragraph by paragraph. The only
wrinkle would be that the <p> tags would be regarded as the end of the
preceding record rather than part of the paragraph that they start, so
given input like:
<p>para one</p>
<p>para
two</p>
<p>para three</p>
the records in $_ in successive loops would be:
1. <p>
2. para one</p>
<p>
3. para
two</p>
<p>
4. para three</p>
If you know that all the HTML that you will be dealing with will be
sufficently well formed then you could use "</p>" as your record
seperator. A lot of HTML in the wild lacks closing tags where browsers
don't require them though.
Regards,
--
Neil Hoggarth Departmental Computer Officer
<neil.hoggarth at physiol.ox.ac.uk> Laboratory of Physiology
http://www.physiol.ox.ac.uk/~njh/ University of Oxford, UK
More information about the Oxford-pm
mailing list