OxPM: Search and Extract

Julian Martin julianmartin at ntlworld.com
Thu Dec 5 05:47:03 CST 2002


Thanks Neil !

----- Original Message -----
From: "Neil Hoggarth" <neil.hoggarth at physiol.ox.ac.uk>
To: <oxford-pm-list at happyfunball.pm.org>
Sent: Thursday, December 05, 2002 11:22 AM
Subject: Re: OxPM: Search and Extract


> On Thu, 5 Dec 2002, Julian Martin wrote:
>
> > I would like to search some html pages for a keyword and then extract
> > the <p>blah, blah......keyword........blah</p> and then put the
> > <p>blah, blah......keyword........blah</p>'s into a results page. Any
> > pointers would be great ! I have Perl cookbook for any reference but
> > cannot find something like this in it.
>
> You could set the input record seperator ("$/", perldoc perlvar for
> info) to "<p>", then the kind of while(<>) loop that would normally
> process input line-by-line will work paragraph by paragraph. The only
> wrinkle would be that the <p> tags would be regarded as the end of the
> preceding record rather than part of the paragraph that they start, so
> given input like:
>
> <p>para one</p>
> <p>para
> two</p>
> <p>para three</p>
>
> the records in $_ in successive loops would be:
>
> 1. <p>
>
> 2. para one</p>
>    <p>
>
> 3. para
>    two</p>
>    <p>
>
> 4. para three</p>
>
> If you know that all the HTML that you will be dealing with will be
> sufficently well formed then you could use "</p>" as your record
> seperator. A lot of HTML in the wild lacks closing tags where browsers
> don't require them though.
>
> Regards,
> --
> Neil Hoggarth                                 Departmental Computer
Officer
> <neil.hoggarth at physiol.ox.ac.uk>                   Laboratory of
Physiology
> http://www.physiol.ox.ac.uk/~njh/                  University of Oxford,
UK




More information about the Oxford-pm mailing list