From julianmartin at ntlworld.com Thu Dec 5 05:00:28 2002 From: julianmartin at ntlworld.com (Julian Martin) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract Message-ID: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> Hi I would like to search some html pages for a keyword and then extract the

blah, blah......keyword........blah

and then put the

blah, blah......keyword........blah

's into a results page. Any pointers would be great ! I have Perl cookbook for any reference but cannot find something like this in it. Thanks Julian. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.pm.org/archives/oxford-pm/attachments/20021205/bca7e68e/attachment.htm From Kavanagm at oup.co.uk Thu Dec 5 05:15:31 2002 From: Kavanagm at oup.co.uk (KAVANAGH, Michael) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract Message-ID: <852ED745A1B8D411BBD900B0D0789E0C08A1CB98@EXC05.oup.co.uk> Hi Julian: Have you tried CPAN? HTML::Index::Search -Mike Kavanagh -----Original Message----- From: Julian Martin [mailto:julianmartin@ntlworld.com] Sent: Thursday, December 05, 2002 11:00 AM To: oxford-pm-list@happyfunball.pm.org Subject: OxPM: Search and Extract Hi I would like to search some html pages for a keyword and then extract the

blah, blah......keyword........blah

and then put the

blah, blah......keyword........blah

's into a results page. Any pointers would be great ! I have Perl cookbook for any reference but cannot find something like this in it. Thanks Julian. From Kevin.ADM-Gibbs at Alcan.Com Thu Dec 5 05:18:56 2002 From: Kevin.ADM-Gibbs at Alcan.Com (Kevin.ADM-Gibbs@Alcan.Com) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract Message-ID: Julian, If you need to download the web page you should have a look at the LWP module. The LWP cook book in the standard distribution give examples of how to use the module. Once you have the file you can use the HTML::Parser module to check the tags (

) you are interested in. You'll then need to use regular expressions to determine if the text contains your keyword. Alternatively you could use regular expressions to do the whole thing but that could be trickier. Cheers, Kev. "Julian Martin" world.com> cc: Sent by: Subject: OxPM: Search and Extract owner-oxford-pm-l ist@pm.org 05/12/2002 11:00 Please respond to oxford-pm-list Hi I would like to search some html pages for a keyword and then extract the

blah, blah......keyword........blah

and then put the

blah, blah......keyword........blah

's into a results page. Any pointers would be great ! I have Perl cookbook for any reference but cannot find something like this in it. Thanks Julian. From neil.hoggarth at physiol.ox.ac.uk Thu Dec 5 05:22:40 2002 From: neil.hoggarth at physiol.ox.ac.uk (Neil Hoggarth) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract In-Reply-To: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> References: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> Message-ID: On Thu, 5 Dec 2002, Julian Martin wrote: > I would like to search some html pages for a keyword and then extract > the

blah, blah......keyword........blah

and then put the >

blah, blah......keyword........blah

's into a results page. Any > pointers would be great ! I have Perl cookbook for any reference but > cannot find something like this in it. You could set the input record seperator ("$/", perldoc perlvar for info) to "

", then the kind of while(<>) loop that would normally process input line-by-line will work paragraph by paragraph. The only wrinkle would be that the

tags would be regarded as the end of the preceding record rather than part of the paragraph that they start, so given input like:

para one

para two

para three

the records in $_ in successive loops would be: 1.

2. para one

3. para two

4. para three

If you know that all the HTML that you will be dealing with will be sufficently well formed then you could use "

" as your record seperator. A lot of HTML in the wild lacks closing tags where browsers don't require them though. Regards, -- Neil Hoggarth Departmental Computer Officer Laboratory of Physiology http://www.physiol.ox.ac.uk/~njh/ University of Oxford, UK From kake at earth.li Thu Dec 5 05:44:08 2002 From: kake at earth.li (Kate L Pugh) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract In-Reply-To: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3>; from julianmartin@ntlworld.com on Thu, Dec 05, 2002 at 11:00:28AM -0000 References: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> Message-ID: <20021205114408.A19733@ox.compsoc.net> On Thu 05 Dec 2002, Julian Martin wrote: > I would like to search some html pages for a keyword and then > extract the

blah, blah......keyword........blah

and then put > the

blah, blah......keyword........blah

's into a results page. HTML::PullParser is nice for parsing HTML. See http://search.cpan.org/author/GAAS/HTML-Parser-3.26/lib/HTML/PullParser.pm and see it in action in the format sub of CGI::Wiki at http://search.cpan.org/src/KAKE/CGI-Wiki-0.05/lib/CGI/Wiki.pm You could probably do something like the following (untested). ------------------------------------------------------------ my ( $buffer, $matched, @results ); my $parser = HTML::PullParser->new( doc => $html_page_content, start => '"START", tag, text', end => '"END", tag, text', text => '"TEXT", tag, text' ); while ( my $token = $parser->get_token ) { my ( $flag, $tag, $text ) = @$token; if ( $flag eq "START" and lc($tag) eq "p" ) { # start of a new paragraph # If the current buffer matched, add it to the results. push @results, $buffer if $matched; # Reinitialise the buffer and reset the "matched" flag. $buffer = ""; $matched = 0; } $buffer .= $text; # whatever this token is, we want it in the buffer # Put your keyword matching stuff in here, and set $matched to 1 # if it does match. } # @results should now be an array of strings, each one containing the # HTML for a paragraph which matched your keywords, and it should be # in the order in which the paragraphs appeared on the page. # It won't have the very last paragraph if that matched, though - see below. ------------------------------------------------------------ But I just wrote that off the top of my head, so don't just cut and paste it blindly, read the docs and check it does what I think it does. You'll also want to add something that checks $matched and pushes the relevant stuff onto @results after the *last* occurrence of

in the page, because as it stands it only updates @results when it sees a

- either be clever inside the while loop, or do something after it finishes. You won't just want to push $buffer on, because it will contain "" or similar. Left as an exercise cos I only just thought of it. Also note that you probably can't count on the page you're parsing having well-formed HTML in it, so make sure you write tests for edge cases. There might be a simpler way to do what you want, though, and I think you might possibly be going about it the wrong way by looking at the document as a set of paragraphs (unless you really can rely on it being well-structured into short paragraphs). Think about things like what if the entire page is one long paragraph? You'd get the whole page returned. Then again most screenscrapers do rely on assumptions, so you're not alone :) Kake who wrote a screenscraper for a wiki -- http://www.earth.li/~kake/cookery/ - vegan recipes, now with new search feature http://grault.net/grubstreet/ - the open-source guide to London http://www.penseroso.com/ - websites for the fine art and antique trade From julianmartin at ntlworld.com Thu Dec 5 05:47:03 2002 From: julianmartin at ntlworld.com (Julian Martin) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract References: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> Message-ID: <008901c29c54$079e3570$67980050@DADDYMDDR2G0O3> Thanks Neil ! ----- Original Message ----- From: "Neil Hoggarth" To: Sent: Thursday, December 05, 2002 11:22 AM Subject: Re: OxPM: Search and Extract > On Thu, 5 Dec 2002, Julian Martin wrote: > > > I would like to search some html pages for a keyword and then extract > > the

blah, blah......keyword........blah

and then put the > >

blah, blah......keyword........blah

's into a results page. Any > > pointers would be great ! I have Perl cookbook for any reference but > > cannot find something like this in it. > > You could set the input record seperator ("$/", perldoc perlvar for > info) to "

", then the kind of while(<>) loop that would normally > process input line-by-line will work paragraph by paragraph. The only > wrinkle would be that the

tags would be regarded as the end of the > preceding record rather than part of the paragraph that they start, so > given input like: > >

para one

>

para > two

>

para three

> > the records in $_ in successive loops would be: > > 1.

> > 2. para one

>

> > 3. para > two

>

> > 4. para three

> > If you know that all the HTML that you will be dealing with will be > sufficently well formed then you could use "

" as your record > seperator. A lot of HTML in the wild lacks closing tags where browsers > don't require them though. > > Regards, > -- > Neil Hoggarth Departmental Computer Officer > Laboratory of Physiology > http://www.physiol.ox.ac.uk/~njh/ University of Oxford, UK From kake at earth.li Thu Dec 5 06:25:56 2002 From: kake at earth.li (Kate L Pugh) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract In-Reply-To: <00a101c29c56$0714f650$67980050@DADDYMDDR2G0O3>; from julianmartin@ntlworld.com on Thu, Dec 05, 2002 at 12:01:21PM -0000 References: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> <20021205114408.A19733@ox.compsoc.net> <00a101c29c56$0714f650$67980050@DADDYMDDR2G0O3> Message-ID: <20021205122556.A21608@ox.compsoc.net> On Thu 05 Dec 2002, Julian Martin wrote: > I do know the pages are well formed cause I made them : ) They exist > as a set of contact detail pages a.htm - z.htm. Rather than put all > the info into a database ( unless you know an easy way ) I thought > it would be a neat way to find say all organisations in Oxford as > they are listed by organisation name in the pages. > What do you think ? I think you might be doing it backwards. Could you start with the raw data, store it somewhere, and auto-generate the static pages from that? You might want to look at DBD::SQLite http://search.cpan.org/author/MSERGEANT/DBD-SQLite-0.21/ if you do decide to use a database - it's a driver for a self-contained relational database that fits in a text file. For auto-generating the pages, there are loads of solutions. I like the Template Toolkit http://search.cpan.org/author/ABW/Template-Toolkit-2.08/ http://tt2.org/ but there are many ways to do it, HTML::Template for example: http://search.cpan.org/author/SAMTREGAR/HTML-Template-2.6/ The important thing to remember is that you can auto-generate static pages too; you don't need to use CGI. Just make the output of your generator goes to a file in the right place. The bloggers call this "baking", I think, as opposed to "frying", which is generating dynamically. Though I think the person who made up that analogy must not cook very much. All these modules flying around are reminding me of the Perl Advent Calendar at http://perladvent.org/ - do take a look. Kake -- http://www.earth.li/~kake/cookery/ - vegan recipes, now with new search feature http://grault.net/grubstreet/ - the open-source guide to London http://www.penseroso.com/ - websites for the fine art and antique trade From kake at earth.li Thu Dec 5 06:43:33 2002 From: kake at earth.li (Kate L Pugh) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract In-Reply-To: ; from neil.hoggarth@physiol.ox.ac.uk on Thu, Dec 05, 2002 at 11:22:40AM +0000 References: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> Message-ID: <20021205124333.B21608@ox.compsoc.net> On Thu 05 Dec 2002, Neil Hoggarth wrote: > You could set the input record seperator ("$/", perldoc perlvar for > info) to "

", then the kind of while(<>) loop that would normally > process input line-by-line will work paragraph by paragraph. Ooh, that's as cunning as a very cunning thing, and much simpler than my overengineered solution. Probably best to do it as a local, though, saves hassle of setting it back. # blah blah blah code { local $/ = "

"; # while loop and processing here } # more code, with the normal input record separator Or you could live dangerously and assume that since your script doesn't currently do any reading-in of data later on that it never will (and that it isn't going to live on for ever and ever and get edited/maintained by people who have no idea what $/ means and can't be bothered to look it up[0]). I was going to put a link here to the neat thing I saw on Perlmonks that helps you remember how which way round $/ and $\ go, but I can't find it now. Basically it used the mnemonic I/O and you have to imagine a raindrop falling down the slash - if it's / then it'll fall into I so you know $/ is the input record separator. If it's \ then it'll fall into O so you know $\ is the output record separator. Other things I have used $\ for recently include reading in SQL commands through the filehandle - setting it to "\n\n" so I can wrap my SQL commands nicely. See for example http://search.cpan.org/src/KAKE/CGI-Wiki-0.05/lib/CGI/Wiki/Setup/MySQL.pm Kake [0] 'perldoc perlvar' for the bemused - there, you have no excuse now. Search it for INPUT_RECORD_SEPARATOR and you'll get the right section. -- http://www.earth.li/~kake/cookery/ - vegan recipes, now with new search feature http://grault.net/grubstreet/ - the open-source guide to London http://www.penseroso.com/ - websites for the fine art and antique trade From julianmartin at ntlworld.com Thu Dec 5 07:03:27 2002 From: julianmartin at ntlworld.com (Julian Martin) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Search and Extract References: <005901c29c4d$857ecdd0$67980050@DADDYMDDR2G0O3> <20021205124333.B21608@ox.compsoc.net> Message-ID: <00c801c29c5e$b3df2600$67980050@DADDYMDDR2G0O3> Thankyou all for your help with this. Ahhh it all sounds so easy now (a sign of certain disaster). I fear I'm a little way back down the learning curve compared to you guys so I am sure it won't take long for you all to become annoyed at my silly questions but thanks anyway (especially for the vegan recipies Kate : ). Julian. ----- Original Message ----- From: "Kate L Pugh" To: Sent: Thursday, December 05, 2002 12:43 PM Subject: Re: OxPM: Search and Extract > On Thu 05 Dec 2002, Neil Hoggarth wrote: > > You could set the input record seperator ("$/", perldoc perlvar for > > info) to "

", then the kind of while(<>) loop that would normally > > process input line-by-line will work paragraph by paragraph. > > Ooh, that's as cunning as a very cunning thing, and much simpler than > my overengineered solution. Probably best to do it as a local, > though, saves hassle of setting it back. > > # blah blah blah code > { > local $/ = "

"; > # while loop and processing here > } > # more code, with the normal input record separator > > Or you could live dangerously and assume that since your script > doesn't currently do any reading-in of data later on that it never > will (and that it isn't going to live on for ever and ever and get > edited/maintained by people who have no idea what $/ means and can't > be bothered to look it up[0]). > > I was going to put a link here to the neat thing I saw on Perlmonks > that helps you remember how which way round $/ and $\ go, but I can't > find it now. Basically it used the mnemonic I/O and you have to > imagine a raindrop falling down the slash - if it's / then it'll fall > into I so you know $/ is the input record separator. If it's \ then > it'll fall into O so you know $\ is the output record separator. > > Other things I have used $\ for recently include reading in SQL > commands through the filehandle - setting it to "\n\n" so I can > wrap my SQL commands nicely. See for example > http://search.cpan.org/src/KAKE/CGI-Wiki-0.05/lib/CGI/Wiki/Setup/MySQL.pm > > Kake > [0] 'perldoc perlvar' for the bemused - there, you have no excuse now. > Search it for INPUT_RECORD_SEPARATOR and you'll get the right section. > -- > http://www.earth.li/~kake/cookery/ - vegan recipes, now with new search feature > http://grault.net/grubstreet/ - the open-source guide to London > http://www.penseroso.com/ - websites for the fine art and antique trade > From pete at clueball.com Thu Dec 5 07:40:10 2002 From: pete at clueball.com (Peter Sergeant) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: perlworkshop.de Message-ID: <20021205134010.GC18898@grou.ch> Hi there, Anyone else planning to come to the German Perl Workshop? Judging by my experiences there last year, about half the talks are in English, so even if, like me, you're non-German-speaking, you'll probably still have a great time! You'll also get to see me talk on "Detecting Email-borne Viruses with Perl"... On that note, Alex and I, and probably Anthony will be going to The Turf next Thursday, and describing it as an Oxford.pm meet, so you're all very welcome! Chances are we'll meet up at about 7. +Pete From pete at clueball.com Fri Dec 13 03:16:57 2002 From: pete at clueball.com (Peter Sergeant) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: Social meet write-up Message-ID: <20021213091657.GA21585@grou.ch> Thanks to the people who turned up last night - to those who didn't: commiserations, The Turf was full of Oxford-interviewees who fed us marshmallows. :-) The next meet will be in about a month's time, on a Thursday. +Pete From kake at earth.li Fri Dec 20 11:57:19 2002 From: kake at earth.li (Kate L Pugh) Date: Thu Aug 5 00:08:36 2004 Subject: OxPM: OUP Perl job from jobs.perl.org Message-ID: <20021220175719.B11047@ox.compsoc.net> I expect many of you know about this already; and it's being mocked on IRC for spelling Perl as 'PERL', but here goes in case anyone is interested. Kake ----- Forwarded message from Perl Jobs ----- Online URL for this job: http://jobs.perl.org/job/569 To subscribe to this list, send mail to jobs-subscribe@perl.org. To unsubscribe, send mail to jobs-unsubscribe@perl.org. Posted: December 20, 2002 Job title: Online developer Company name: Oxford University Press Location: United Kingdon, Oxford Pay rate: to ?30,000 Travel: 0% Terms of employment: Salaried employee Hours: Full time Onsite: yes Description: On-line devloper responsible for developemt and maintenance of 9 OUP Journals databases which create and update the OUP Journals website and its search engine. Required skills: PERL, Apache PHP, Unix - Sun Solaris and Linux, MySQL Desired skills: Good communication skills Contact information: Pam Sutherland Oxford University Press Great Clarendon Street, Oxford OX2 6DP United Kingdom sutherlp@oup.co.uk ----- End forwarded message -----