[Melbourne-pm] brace / C CSV reader, Re: Still performing well..
toby.corkindale at strategicdata.com.au
Thu May 20 21:39:21 PDT 2010
On 21/05/10 14:05, Sam Watkins wrote:
> hi Toby,
>> So, are you going to provide us with a well-written Brace program that
>> performs the same task as the Go/Scala/Perl implementations then?
> Ok, I did write a brace / C program that does the same thing. It's not
> terribly well-written but it doesn't allocate memory in its inner loop. Brace
> translates directly to C, so it performs the same as an equivalent C program
> using the same techniques.
>> It's very simple, should only take you 5 minutes if you're familiar with
>> your language.
> lol, I wrote the CSV parser too so it took a little longer than that!
> it might be useful for me in future though.
>> Only requirements are:
>> You perform "good" practice as far as closing files and freeing memory
>> goes - by which I mean that you do eventually close the file and free
>> memory, and your program should not grow to infinite memory size if the
>> input file keeps growing.
> Ok, in this case I don't need to do that explicitly in brace. Two of the block
> structures I used take care of it for me. 'F_in' opens and closes a file.
> 'eachline' reads all the lines from a file using a single buffer, which it
> frees at the end.
>> So you can use a static buffer for every row of the CSV if you like.
> 'eachline' does use a single resizable buffer.
>> Given the same input file, your output should match the Perl output.
> There were a few discrepancies originally because I had used float rather than
> double. I changed it to use double (num), and now it does match exactly.
> Here are the times on my VPS:
> $ ./read.pl data.csv>/dev/null
> Code took 2.23198 wallclock secs ( 2.19 usr + 0.00 sys = 2.19 CPU) @ 0.46/s (n=1)
> $ ./read.b data.csv>/dev/null
> Code took: 0.227998
> So my brace / C version is nearly 10 times faster than perl, even though the
> perl version uses Text::CSV_XS, which is written in C! The difficulty for
> Text::CSV_XS is that it must allocate and use perl data structures.
> I don't doubt that Go would perform similarly to C or brace, if the CSV parser
> were written in the same way.
> I repeated the test with Text::CSV_PP, which uses pure perl rather than C / XS
> to parse the CSV, and might be a better comparison of the two langauges for the
> stated task. I get this result:
> $ ./read-pp.pl data.csv>/dev/null
> Code took 12.1239 wallclock secs (11.86 usr + 0.01 sys = 11.87 CPU) @ 0.08/s (n=1)
> I'm surprised that the pure-perl version is only about 5 times slower than the
> XS version, that's a real credit to perl. However the brace / C version is 53
> times faster than the perl version.
> My brace program is not low-level programming. It uses my library libb and is
> shorter and more high-level than the other programs shown. It uses
> medium-level generic data structures: resizable string buffers, and vectors.
> If I avoided these and used low-level C arrays, it could run much faster. But
> it's already twice as fast as GNU wc on the same data. Actually that is
> because it's using UTF-8, in the C locale it's 6 times faster than my brace
> program. If the brace program / my libs were optimized for speed I think it
> could approach that speed for parsing CSV.
> For myself, I prefer to use TSV format with C-style escapes \t \n \\.
> Unlike CSV, that works nicely with awk, cut and other unix tools.
> I'll add the CSV parser to my brace library so I can use it again.
> Sam Watkins
> Here is the main program in brace, as you can see it's about half the size of
> the perl version and 1/4 the size of the scala version. Including the CSV
> parser it's about the same size as the scala version! brace is work in
> progress, so this program could be improved as I improve the library.
> (Accessing objects and vector elements is a bit ugly at the moment.)
> I can provide the stand-alone C translation of the brace version if you'd like
> to test it.
.. that might be neccessary, as I just tried compiling Brace on my
computer, and it fails to build:
.all.c: In function ‘gr__mitshm_fault_h’:
.all.c:16876: error: ‘X_ShmAttach’ undeclared (first use in this function)
.all.c:16876: error: (Each undeclared identifier is reported only once
.all.c:16876: error: for each function it appears in.)
That thing is declared in /usr/include/X11/extensions/shmproto.h so
maybe it's just not being included in your code somewhere?
More information about the Melbourne-pm