[Melbourne-pm] Scraping Media Wiki

scottp at dd.com.au scottp at dd.com.au
Tue Jan 12 19:49:02 PST 2010


I think it is a three part answer:

* WWW:Mechanize or even just LWP to get the page
* XML format may give you some benefits, such as Date modified
* Then parse the content there are a number of wiki text parsers on CPAN, none of them great, but most ok. Converting to HTML may be your best bet, at least then it is in HTML  table format.

There are some Mediawiki API classes too, but I have not used them:

* WWW::Mediawiki::Client
* MediaWiki::API

FYI, if you are parsing a whole site, I highly recommend Parse::MediaWikiDump - as it parses the XML dump files very fast, you don't even need an XML complete file.

Scooter

----- "Alec Clews" <alec.clews at gmail.com> wrote:

> G'Day,
> 
> I have some Media Wiki pages, laid out for the benefit of humans 
> (irregularly) , with tables of values.
> 
> Does anyone have a recommendation on a suitable module to scrap the
> some 
> of the values? I figured WWW::Mechanize, but I've never used it.
> 
> Cheers
> 
> -- 
> Alec Clews
> Personal <alec.clews at gmail.com>			Melbourne, Australia.
> Jabber:  alecclews at jabber.org.au		PGPKey ID: 0x9BBBFC7C
> Blog  http://alecthegeek.wordpress.com/
> 
> _______________________________________________
> Melbourne-pm mailing list
> Melbourne-pm at pm.org
> http://mail.pm.org/mailman/listinfo/melbourne-pm

-- 
http://scott.dd.com.au/
scottp at dd.com.au




More information about the Melbourne-pm mailing list