[Phoenix-pm] Meeting yesterday (June 30) wrapup
Michael Friedman
friedman at highwire.stanford.edu
Fri Jul 1 13:02:05 PDT 2005
I may have said something, but it was too quiet for anyone but Bob to
hear.
We do all our HTML parsing using HTML::Parser and some wrapping
objects. Since we need to pick out particular elements of meta-data
which may be tagged several different ways, we found that using an
event model was easier to work with.
Also, if you're doing large-scale spidering, another hint is to make
one process that just goes and grabs the pages and a separate one to
parse them and deal with their contents. That way you keep the scripts
smaller and more focused on the individual jobs -- at the price of
extra disk space, of course. :-) For small scale stuff like Scott's
example, it's easy to leave the two together.
-- Mike
On Jul 1, 2005, at 12:55 PM, Scott Walters wrote:
> Hi everyone,
>
> Thanks for coming, and thanks for listening to me talk. As usual, I
> talked
> for longer than I meant to. Whoops. Nello's was unusually busy. We'll
> probably do Nello's again, but not for a while, and not for code
> presentations... regardless, I think having food at the meetings made
> things a lot easier (we're all busy people) so I think I'll see about
> ordering out for pizza for our regular meetings, whereever they wind
> up.
>
> Good to meet all of the new people. I'm sorry I didn't a chance to
> chat with
> you guys more and I hope you'll be back. It's hard to get to know
> people
> in two hours with so much chaos.
>
> I completely forgot to give the door prize, CGI Programming with Perl,
> to
> Brock to give away. D'oh! Next meeting, we'll just have to have two
> door prizes.
> Sorry to everyone who only came because of the door prize. Next time,
> remind
> me, or Brock, or someone.
>
> Er, ehm, without further ado, here's yafro.pl.
>
> Again, you shouldn't use http.pm or TransientBaby -- they're for
> educational
> purposes only. If you actually do any Web scraping, use HTML::Parser,
> HTML::TableExtractor, or something sane. Which means you'll have
> to modify this to use another HTML parser. That shouldn't be hard to
> do if you use an event based one.
>
> I was expecting people to chime in and comment on how *they* scraped
> Web content but instead Michael just gave a lot of examples of how he
> *blocks* robots. Heh, heh, heh.
>
> For the benefit of people not at the meeting, here are a few comments
> on
> the code: would have been easier to just extract all images with URLs
> matching a certain pattern, and the *get_page =
> http::generate_get_page;
> this is odd and would have been better done with the Exporter (module).
>
> Okay. Talk to ya'll later.
> -scott
>
> On 0, Brock <awwaiid at thelackthereof.org> wrote:
>>
>> We had a lovely meeting yesterday, with 10 people enjoying dinner and
>> perl-talk at Nello's Pizza. In addition to random perl-related
>> conversation, Mike Friedman spoke of HighWire Press [1] and their use
>> of
>> Perl, and Scott Walters told us about how he does web-scraping (and he
>> will post code soon we hope :) ).
>>
>> We'll have the website up soon enough. In the meantime keep an eye out
>> here for the next meeting topic/time/location, which I will have
>> picked
>> out within the next two weeks. Please send topic requests and
>> volunteership here to the list. I like the idea of doing two talks
>> like
>> we did this time so that we can cover a potentially wider range of
>> interest and experience.
>>
>> Have a good (long) weekend!
>> --Brock
>>
>> [1] http://highwire.stanford.edu/
>>
>> _______________________________________________
>> Phoenix-pm mailing list
>> Phoenix-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/phoenix-pm
> <yafro.pl><webscraping_phoenix_pm.txt>_________________________________
> ______________
> Phoenix-pm mailing list
> Phoenix-pm at pm.org
> http://mail.pm.org/mailman/listinfo/phoenix-pm
---------------------------------------------------------------------
Michael Friedman HighWire Press, Stanford Southwest
Phone: 480-456-0880 Tempe, Arizona
FAX: 270-721-8034 <friedman at highwire.stanford.edu>
---------------------------------------------------------------------
More information about the Phoenix-pm
mailing list