[Pdx-pm] Wikipedia dump file XML shootout

Erik Hollensbe erik at hollensbe.org
Sun Dec 6 12:03:53 PST 2009


On 12/6/2009 2:20 PM, Tyler Riddle wrote:
> Hello mongers,
>
> I've been doing research into which perl XML processing schemes are
> fastest as I design my replacement for Parse::MediaWikiDump. What I've
> wound up with is a nice benchmarking system and a large pile of test
> cases. I'm hoping there's some XML gurus or just weirdoes like me who
> like to try to make things go as fast as possible who are interested
> in either creating a benchmark for their favorite XML processor or
> seeing if my existing SAX handlers, etc, could be optimized more. If
> anyone is interested you can get the source via SVN from
> https://triddle.projecthut.com/svn/triddle/XML_Speed_Test/
>    

This looks pretty cool.

One thing about your benchmark that I am concerned about: the md5 
calculation appears to be affecting the time of the test. I'm not sure 
if this is intentional or not, but I thought I'd mention it anyways.

I would be really surprised if the iksemel parser (once it's working, 
assuming it's a smartly-written binding) doesn't blow everything else 
out of the water; one of the iksemel features is that it throws away 
giant featuresets in the name of performance.

-Erik


More information about the Pdx-pm-list mailing list