[Melbourne-pm] Scraping Media Wiki

Alfie John alfiejohn at gmail.com
Tue Jan 12 20:01:46 PST 2010


On Wed, Jan 13, 2010 at 2:49 PM, <scottp at dd.com.au> wrote:

> I think it is a three part answer:
>
> * WWW:Mechanize or even just LWP to get the page
> * XML format may give you some benefits, such as Date modified
> * Then parse the content there are a number of wiki text parsers on CPAN,
> none of them great, but most ok. Converting to HTML may be your best bet, at
> least then it is in HTML  table format.
>
> There are some Mediawiki API classes too, but I have not used them:
>
> * WWW::Mediawiki::Client
> * MediaWiki::API
>

I agree with WWW::Mechanize.  But if you don't manage to get any of the wiki
parsers going and your data is consistent, you could try Template::Extract.

Alfie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/melbourne-pm/attachments/20100113/bee6151f/attachment.html>


More information about the Melbourne-pm mailing list