[Pdx-pm] Wikipedia dump file XML shootout

Tyler Riddle triddle at gmail.com
Sun Dec 6 12:41:02 PST 2009

> This looks pretty cool.

Thanks, it's quite a few hours of work.

> One thing about your benchmark that I am concerned about: the md5
> calculation appears to be affecting the time of the test. I'm not sure if
> this is intentional or not, but I thought I'd mention it anyways.

That's not intentional, I worked to try to avoid it. bench_child()
passes the data from times() straight back to the main process but the
main process only uses the child process times fields, from
parse_result() (with a new comment):

sub parse_result {
	my ($text) = @_;
	if ($text !~ m/^[0-9.]+ [0-9.]+ ([0-9.]+) ([0-9.]+) (.+)/) {
		return();                       #^^^^^ match child times of forked
process and md5sum
	return ($1, $2, $3);

I think I got it right though even if it's not quite clear at first
glance. The forked process executes the benchmark via open()
accumulating the actual time spent processing the XML in the child
fields for times(). The md5 processing should be showing up only in
the parent fields for times(). Did I screw this up?

> I would be really surprised if the iksemel parser (once it's working,
> assuming it's a smartly-written binding) doesn't blow everything else out of
> the water; one of the iksemel features is that it throws away giant
> featuresets in the name of performance.

I'm also really interested in seeing how Iksemel performs but I can't
get it working. Hopefully someone more knowledgeable in C can figure
out what I did wrong. It's currently throwing an Iksemel memory error
which seems strange to me. It follows the same pattern as the parsers
I created in C for libxml and expat, it should be straight forward,
but something's wonky.

Thanks for the feedback, I appreciate it.

Tyler Riddle

If you wish to make an apple pie from scratch you must first invent
the universe. -- Carl Sagan

More information about the Pdx-pm-list mailing list