[Pdx-pm] Wikipedia dump file XML shootout

Erik Hollensbe erik at hollensbe.org
Sun Dec 6 13:03:49 PST 2009

On 12/6/2009 3:41 PM, Tyler Riddle wrote:
>> One thing about your benchmark that I am concerned about: the md5
>> calculation appears to be affecting the time of the test. I'm not sure if
>> this is intentional or not, but I thought I'd mention it anyways.
> That's not intentional, I worked to try to avoid it. bench_child()
> passes the data from times() straight back to the main process but the
> main process only uses the child process times fields, from
> parse_result() (with a new comment):

I'm fairly certain ->add() is what does the bulk of the processing, 
otherwise trying to stuff 22GB into memory for later digestion would 
prove... problematic on most machines. Unless I am reading your code 
incorrectly, that is done in the middle of the timing vector.

An alternative (and likely not much better) solution would be to stuff 
it to disk and calculate it wholesale afterwards, or (perhaps better) 
sum the MD5 time and subtract it from the overall running time.

>> I would be really surprised if the iksemel parser (once it's working,
>> assuming it's a smartly-written binding) doesn't blow everything else out of
>> the water; one of the iksemel features is that it throws away giant
>> featuresets in the name of performance.
> I'm also really interested in seeing how Iksemel performs but I can't
> get it working. Hopefully someone more knowledgeable in C can figure
> out what I did wrong. It's currently throwing an Iksemel memory error
> which seems strange to me. It follows the same pattern as the parsers
> I created in C for libxml and expat, it should be straight forward,
> but something's wonky.

If you find the time to get this working, alert me (or the list) of your 
discoveries, I am genuinely curious.


More information about the Pdx-pm-list mailing list