[Phoenix-pm] Meeting yesterday (June 30) wrapup

Michael Friedman friedman at highwire.stanford.edu
Fri Jul 1 13:02:05 PDT 2005


I may have said something, but it was too quiet for anyone but Bob to  
hear.

We do all our HTML parsing using HTML::Parser and some wrapping  
objects. Since we need to pick out particular elements of meta-data  
which may be tagged several different ways, we found that using an  
event model was easier to work with.

Also, if you're doing large-scale spidering, another hint is to make  
one process that just goes and grabs the pages and a separate one to  
parse them and deal with their contents. That way you keep the scripts  
smaller and more focused on the individual jobs -- at the price of  
extra disk space, of course. :-) For small scale stuff like Scott's  
example, it's easy to leave the two together.

-- Mike

On Jul 1, 2005, at 12:55 PM, Scott Walters wrote:

> Hi everyone,
>
> Thanks for coming, and thanks for listening to me talk. As usual, I  
> talked
> for longer than I meant to. Whoops. Nello's was unusually busy. We'll
> probably do Nello's again, but not for a while, and not for code
> presentations... regardless, I think having food at the meetings made
> things a lot easier (we're all busy people) so I think I'll see about
> ordering out for pizza for our regular meetings, whereever they wind  
> up.
>
> Good to meet all of the new people. I'm sorry I didn't a chance to  
> chat with
> you guys more and I hope you'll be back. It's hard to get to know  
> people
> in two hours with so much chaos.
>
> I completely forgot to give the door prize, CGI Programming with Perl,  
> to
> Brock to give away. D'oh! Next meeting, we'll just have to have two  
> door prizes.
> Sorry to everyone who only came because of the door prize. Next time,  
> remind
> me, or Brock, or someone.
>
> Er, ehm, without further ado, here's yafro.pl.
>
> Again, you shouldn't use http.pm or TransientBaby -- they're for  
> educational
> purposes only. If you actually do any Web scraping, use HTML::Parser,
> HTML::TableExtractor, or something sane. Which means you'll have
> to modify this to use another HTML parser. That shouldn't be hard to
> do if you use an event based one.
>
> I was expecting people to chime in and comment on how *they* scraped
> Web content but instead Michael just gave a lot of examples of how he
> *blocks* robots. Heh, heh, heh.
>
> For the benefit of people not at the meeting, here are a few comments  
> on
> the code: would have been easier to just extract all images with URLs
> matching a certain pattern, and the *get_page =  
> http::generate_get_page;
> this is odd and would have been better done with the Exporter (module).
>
> Okay. Talk to ya'll later.
> -scott
>
> On  0, Brock <awwaiid at thelackthereof.org> wrote:
>>
>> We had a lovely meeting yesterday, with 10 people enjoying dinner and
>> perl-talk at Nello's Pizza. In addition to random perl-related
>> conversation, Mike Friedman spoke of HighWire Press [1] and their use  
>> of
>> Perl, and Scott Walters told us about how he does web-scraping (and he
>> will post code soon we hope :) ).
>>
>> We'll have the website up soon enough. In the meantime keep an eye out
>> here for the next meeting topic/time/location, which I will have  
>> picked
>> out within the next two weeks. Please send topic requests and
>> volunteership here to the list. I like the idea of doing two talks  
>> like
>> we did this time so that we can cover a potentially wider range of
>> interest and experience.
>>
>> Have a good (long) weekend!
>> --Brock
>>
>>   [1] http://highwire.stanford.edu/
>>
>> _______________________________________________
>> Phoenix-pm mailing list
>> Phoenix-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/phoenix-pm
> <yafro.pl><webscraping_phoenix_pm.txt>_________________________________ 
> ______________
> Phoenix-pm mailing list
> Phoenix-pm at pm.org
> http://mail.pm.org/mailman/listinfo/phoenix-pm
---------------------------------------------------------------------
Michael Friedman                  HighWire Press, Stanford Southwest
Phone: 480-456-0880                                   Tempe, Arizona
FAX:   270-721-8034                  <friedman at highwire.stanford.edu>
---------------------------------------------------------------------



More information about the Phoenix-pm mailing list