[Pdx-pm] Wikipedia dump file XML shootout
Erik Hollensbe
erik at hollensbe.org
Sun Dec 6 13:03:49 PST 2009
On 12/6/2009 3:41 PM, Tyler Riddle wrote:
>> One thing about your benchmark that I am concerned about: the md5
>> calculation appears to be affecting the time of the test. I'm not sure if
>> this is intentional or not, but I thought I'd mention it anyways.
>>
> That's not intentional, I worked to try to avoid it. bench_child()
> passes the data from times() straight back to the main process but the
> main process only uses the child process times fields, from
> parse_result() (with a new comment):
>
I'm fairly certain ->add() is what does the bulk of the processing,
otherwise trying to stuff 22GB into memory for later digestion would
prove... problematic on most machines. Unless I am reading your code
incorrectly, that is done in the middle of the timing vector.
An alternative (and likely not much better) solution would be to stuff
it to disk and calculate it wholesale afterwards, or (perhaps better)
sum the MD5 time and subtract it from the overall running time.
>> I would be really surprised if the iksemel parser (once it's working,
>> assuming it's a smartly-written binding) doesn't blow everything else out of
>> the water; one of the iksemel features is that it throws away giant
>> featuresets in the name of performance.
>>
>>
> I'm also really interested in seeing how Iksemel performs but I can't
> get it working. Hopefully someone more knowledgeable in C can figure
> out what I did wrong. It's currently throwing an Iksemel memory error
> which seems strange to me. It follows the same pattern as the parsers
> I created in C for libxml and expat, it should be straight forward,
> but something's wonky.
>
If you find the time to get this working, alert me (or the list) of your
discoveries, I am genuinely curious.
-Erik
More information about the Pdx-pm-list
mailing list