[Melbourne-pm] brace / C CSV reader, Re: Still performing well..

Thu May 20 21:05:10 PDT 2010

hi Toby,

> So, are you going to provide us with a well-written Brace program that  
> performs the same task as the Go/Scala/Perl implementations then?

Ok, I did write a brace / C program that does the same thing.  It's not
terribly well-written but it doesn't allocate memory in its inner loop.  Brace
translates directly to C, so it performs the same as an equivalent C program
using the same techniques.

> It's very simple, should only take you 5 minutes if you're familiar with  
> your language.

lol, I wrote the CSV parser too so it took a little longer than that!
it might be useful for me in future though.

> Only requirements are:
> You perform "good" practice as far as closing files and freeing memory  
> goes - by which I mean that you do eventually close the file and free  
> memory, and your program should not grow to infinite memory size if the  
> input file keeps growing.

Ok, in this case I don't need to do that explicitly in brace.  Two of the block
structures I used take care of it for me.  'F_in' opens and closes a file.
'eachline' reads all the lines from a file using a single buffer, which it
frees at the end.

> So you can use a static buffer for every row of the CSV if you like.

'eachline' does use a single resizable buffer.

> Given the same input file, your output should match the Perl output.

There were a few discrepancies originally because I had used float rather than
double.  I changed it to use double (num), and now it does match exactly.

Here are the times on my VPS:

	$ ./read.pl data.csv >/dev/null
	Code took 2.23198 wallclock secs ( 2.19 usr +  0.00 sys =  2.19 CPU) @  0.46/s (n=1)

	$ ./read.b data.csv >/dev/null
	Code took: 0.227998

So my brace / C version is nearly 10 times faster than perl, even though the
perl version uses Text::CSV_XS, which is written in C!  The difficulty for
Text::CSV_XS is that it must allocate and use perl data structures.

I don't doubt that Go would perform similarly to C or brace, if the CSV parser
were written in the same way.

I repeated the test with Text::CSV_PP, which uses pure perl rather than C / XS
to parse the CSV, and might be a better comparison of the two langauges for the
stated task.  I get this result:

	$ ./read-pp.pl data.csv >/dev/null
	Code took 12.1239 wallclock secs (11.86 usr +  0.01 sys = 11.87 CPU) @  0.08/s (n=1)

I'm surprised that the pure-perl version is only about 5 times slower than the
XS version, that's a real credit to perl.  However the brace / C version is 53
times faster than the perl version.

My brace program is not low-level programming.  It uses my library libb and is
shorter and more high-level than the other programs shown.  It uses
medium-level generic data structures: resizable string buffers, and vectors.
If I avoided these and used low-level C arrays, it could run much faster.  But
it's already twice as fast as GNU wc on the same data.  Actually that is
because it's using UTF-8, in the C locale it's 6 times faster than my brace
program.  If the brace program / my libs were optimized for speed I think it
could approach that speed for parsing CSV.

For myself, I prefer to use TSV format with C-style escapes \t \n \\.
Unlike CSV, that works nicely with awk, cut and other unix tools.
I'll add the CSV parser to my brace library so I can use it again.

Sam Watkins

Here is the main program in brace, as you can see it's about half the size of
the perl version and 1/4 the size of the scala version.  Including the CSV
parser it's about the same size as the scala version!  brace is work in
progress, so this program could be improved as I improve the library.
(Accessing objects and vector elements is a bit ugly at the moment.)

I can provide the stand-alone C translation of the brace version if you'd like
to test it.

	#!/usr/bin/env bx
	use b
	cstr usage[] =
		"input_file",
		NULL
	Main():
		bm_block("Code took"):
			args(cstr, file)
			new(v, vec, cstr, 3)
			F_in(file):
				rl()
				eachline(l):
					vec_clear(v)
					split_csv(v, l)
					cstr name = *(cstr*)v(v, 0)
					int i = atoi(*(cstr*)v(v, 1))
					num f = atof(*(cstr*)v(v, 2))
					sf("%s is %.02f", name, i*f)

Here is the CSV parser, this will go in my library libb:

	def split_csv(v, l):
		split_csv(v, l, my(p), my(f), my(c), my(o))

	def split_csv(v, l, p, f, c, o):
		char *p = l
		char *f = p
		repeat:
			char c = *p
			if c == '\0':
				vec_push(v, f)
				break
			 eif c == ',':
				*p++ = '\0'
				vec_push(v, f)
				f = p
			 eif c == '"':
				f = ++p
				csv_read_quoted(p, c, o)
			 else:
				++p

	def csv_read_quoted(p, c, o):
		repeat:
			char c = *p
			if c == '\0':
				error("csv: mismatched quotes")
			 eif c == '"':
				if p[1] == '"':
					char *o = p+1
					p += 2
					csv_read_quoted_copy(o, p, c)
					break
				 eif among(p[1], ',', '\0'):
					*p++ = '\0'
					break
				 else:
					error("csv: quote inside field must be followed by quote, comma or newline")
			 else:
				++p

	def csv_read_quoted_copy(o, i, c):
		repeat:
			char c = *i
			if c == '\0':
				error("csv: mismatched quotes")
			 eif c == '"':
				if i[1] == '"':
					++i
				 eif among(i[1], ',', '\0'):
					*o++ = '\0'
					++i
					break
				 else:
					error("csv: quote inside field must be followed by quote, comma or newline")
			 else:
				*o++ = *i++