[Edinburgh-pm] seeking collaboration for a scalable Perl web-scraping library

Antonio Bonifati antonio.bonifati at gmail.com
Sun Mar 13 12:45:25 PDT 2011


> Although I'd still prefer something that could embed or wrap webkit or gecko
> with enough glue to make them accessible from Perl and enough hooks to be able
> to (ab)use the parsers & rendering/js engines.  Especially the HTML5 parser.

There is already a CPAN module for that: WWW::Mechanize::Firefox and maybe others
for other browsers like Microzoz Internet Exploder.

The trouble with screen scrapers is that they do not scale up and are unstable
and slower than headless browsers although they provide perfect emulation of
any browser quirks (misbehaviour and crashes included) since you are actually
using a real browser.

But take for example my company: they need to run hundreds of scrapers in parallel
nightly on a cheap server. What would happen on a Firefox with 100 tabs or
worse when opening 100 separate firefox instances? They end up using a lot of memory
and CPU time just for rendering pages that nobody will ever see.

Moreover in some of our products we need to do server-side real-time scraping
and thus we have to be quicker and lighter than a GUI browser.

As for HTML5 I haven't tested HtmlUnit but there is no documentation about it.
I suppose it is unsupported for now, especially advanced features. Now there are
only a few sites that use heavily HTML5 features and I have never come across one
at my workplace. If it spreads quickly that would be a problem, but according
to W3C it will become a recommendation in 2020, sigh! HTML5 is a bit better
structured than HTML4 but I bet that even with it JavaScript usage will make
a scraper's life difficult, because there is no neat separation between content
and presentation, is there?

In a nutshell I would still use headless browsers for massive web scraping and revert
to screen scraping only when unsupported features are required or CPU and memory footprint
is not a concern.

-- 
regards / saluti
--
Antonio Bonifati (sysadmin, web programmer / sistemista, programmatore web)
English mobile: +44 7400977350
BLOG: http://antonio-bonifati.blogspot.com
My profile: http://www.google.com/profiles/antonio.bonifati

skype: antonio.bonifati
msn: ant at venus.deis.unical.it
gtalk: antonio.bonifati at gmail.com
--
There are no hard distinctions between what is real and what is unreal, nor between what is true and what is false. A thing is not necessarily either true or false; it can be both true and false. 
Harold Pinter

Non vi sono distinzioni nette tra ciò che è reale e ciò che è irreale, nè tra quello che è vero e quello che è falso. Una cosa non è necessariamente vera o falsa, può essere sia vera che falsa allo stesso tempo


More information about the Edinburgh-pm mailing list