[Melbourne-pm] and the winner is C! (so far anyway, no big surprise)

Fri May 21 01:23:26 PDT 2010

results first!

Here's the leaderboard of CSV readers in various languages, compared to C,
for a 100,000 line CSV file:

	C           1.00
	brace       1.16
	perl XS    11.33
	(bad) go   17.50
	scala      19.32
	perl       62.51

I wrote a C version, based on my brace version (the 1.00 above).  It's 14%
faster than the brace version, because its not using fancy "vectors" or
"buffers", just arrays.  It's less flexible though: has a limited line length
and number of fields, and does not check for overflow at the moment.  I could
fix that with a minimal performance hit, but that's what those brace data
structures are for anyway, I can't be bothered doing it all again with realloc
right now.

	$ ./read-c data.csv >/dev/null
	Code took 0.195998

	$ ./read.b data.csv > /dev/null 
	Code took: 0.227998

	$ ./read.pl data.csv >/dev/null 
	Code took 2.21998 wallclock secs ( 2.17 usr +  0.01 sys =  2.18 CPU) @  0.46/s (n=1)

	$ ./read-pp.pl data.csv >/dev/null
	Code took 12.2519 wallclock secs (11.99 usr +  0.00 sys = 11.99 CPU) @  0.08/s (n=1)

People can add languages / programs to this table just by comparing speed to
the perl or C version running on their test machine, no need to run every
version.  (I didn't try to run the scala or go versions yet.)  How about a
parallel version that runs on the GPU?  ;)

I'm not saying "go" is bad, I'm saying the go code used must have been bad
because go is definitely not that much slower than C.

I'd like to see what JavaScript + V8 and JavaScript + Tracemonkey could do
here, and how Java + android javac would do.  I guessing these three have
better performance than perl XS but at least 5 times slower than C.  I'd also
like to see how a "malloc and free everything" C version would compare, I'm
guessing about 3 or 4 times slower.

I should fix the C / brace version to use fread not fgets for better
correctness (allowing \n in quoted fields) and maybe to go a little faster.

The printf output, even going to /dev/null, took more than half the time for
the C code; so if we are testing just CSV reading, C is actually "more faster"
than my figures indicate.

	$ ./read-quiet-c data.csv
	Code took 0.080000

The perl output apparently takes a large chunk of the time for perl XS too:

	$ ./read-quiet.pl data.csv 
	Code took 1.48799 wallclock secs ( 1.47 usr +  0.00 sys =  1.47 CPU) @  0.68/s (n=1)

	$ ./read-pp-quiet.pl data.csv
	Code took 11.3239 wallclock secs (11.13 usr +  0.00 sys = 11.13 CPU) @  0.09/s (n=1)

Still, for our task of CSV reading, excluding the calculation and output,
a rough bit of C code is 18.60 times faster than Perl XS (Text::CSV_XS) and
141.55 times faster than pure Perl (Text::CSV_PP).  Perl+XS may be able to
compete with Scala but neither can compete with C (for speed).

sed|awk performs pretty well:

	$ time < data.csv sed 's/,/   /g; s/""/"/g; s/"//g;' | awk '{print $1" is "$2*$3}' >/dev/null

	real    0m1.228s

It's also a lot shorter (although not quite correct, silly CSV!).

Using clean TSV instead of silly CSV, gawk achieves 0.428s
and mawk achieves 0.268s which is getting pretty close to C.

Toby Corkindale wrote:
> OK.
> I do have the right packages installed, I'm pretty sure of that - as I  
> said, I can see the include files in /usr/include.

I'll try building brace at home later, if my laptop works, it's acting a bit
flakey, I think a RAM module is loose or something.

Sam