[Pdx-pm] Wikipedia dump file XML shootout

Sun Dec 6 11:20:24 PST 2009

Hello mongers,

I've been doing research into which perl XML processing schemes are
fastest as I design my replacement for Parse::MediaWikiDump. What I've
wound up with is a nice benchmarking system and a large pile of test
cases. I'm hoping there's some XML gurus or just weirdoes like me who
like to try to make things go as fast as possible who are interested
in either creating a benchmark for their favorite XML processor or
seeing if my existing SAX handlers, etc, could be optimized more. If
anyone is interested you can get the source via SVN from
https://triddle.projecthut.com/svn/triddle/XML_Speed_Test/

The README for the project is at the end of this email just for good measure.

Cheers,

Tyler Riddle

-- 
If you wish to make an apple pie from scratch you must first invent
the universe. -- Carl Sagan

__END__

ABOUT

This is a benchmark system for XML parsers against various language editions of
the Wikipedia. The benchmark is to print all the article titles and text of a
dump file specified on the command line to standard output. There are
implementations for many perl parsing modules both high and low level. There
are even implementations written in C that perform very fast.

The benchmark.pl program is used to run a series of benchmarks. It takes two
required arguments and one optional. The first required argument is a path to
a directory full of tests to execute. The second required argument is a path
to a directory full of dump files to execute the tests against. Both of these
directories will be executed according to sort() on their file names. The third
argument is a number of iterations to perform, the default being 1.

Output goes to two files: results.log and results.data - they both are the
output from Data::Dumper of an internal data structure that represents the
test report. The results.log file is written to each time all the tests
have been run against a specific file and lets you keep an eye on how long
running jobs are performing. The results.data file is the cumulative data
for all iterations and is written at the end of the entire run.

The benchmark.pl utility and all of the tests are only guaranteed to work
if executed from the root directory of this software package. The C based
parsers are in the bin/ directory and can be compiled by executing make in
that directory. The Iksemel parser is not currently functional for unknown
reasons.

THE CHALLENGE

First and foremost the most important thing to keep in mind is that the English
Wikipedia is currently 22 gigabytes of XML. You will not be able to use any
XML processing system that requires the entire document to fit into RAM.

Each benchmark must gather up the title and text for each Wikipedia article
for an arbitrary XML dump file. In the spirit of making this test approximate
a real world scenario you must collect all character data together and make it
available at one time. For instance in the perl benchmarks they actually invoke
a common method that prints the article title and text for them. In the C based
tests they simply collect all the data and print it out at once.

TEST DATA
You can find various MediaWiki dump files via http://download.wikimedia.org/
I use the following various language Wikipedia dump files for my testing:

http://download.wikimedia.org/enwiki/20091103/enwiki-20091103-pages-articles.xml.bz2
http://download.wikimedia.org/eowiki/20091204/eowiki-20091204-pages-articles.xml.bz2
http://download.wikimedia.org/simplewiki/20091203/simplewiki-20091203-pages-articles.xml.bz2

TODO

  * It would be nice if the C based parsers were glued to perl with XS
so they invoke the
    Bench::Article method just like the perl based parsers do.

  * Fix the Iksemel parser.

  * One common string buffering library between all C based parsers
would be nice
    but I could not get this functional.

AUTHOR

Test suite and initial tests created by Tyler Riddle <triddle at gmail.com>
Please send any patches to me and feel free to add yourself to the
contributors list.

CONTRIBUTORS

  * No one yet - you know you want to be first!