westerman at purdue.edu
Thu May 22 13:30:22 PDT 2008
As people who attended the last PM meeting know, I talked about
YAML::SYCK and how the YAML data serialization format can be used.
Derrick asked what the performance of YAML was compared to just reading
I tried two different methods to find this out.
First I did a random made-up data set of about 300,000 people with
addresses, names, phone numbers, etc. As one might expect, the size of
the YAML file is larger than a non-YAML file since the YAML file has to
contain tag words in addition to the data. Exactly how much large
depends on the keyword size to data size ratio but at the worst case one
would probably expect no more than a 2:1 ratio. Thus the YAML size
would be 3 times as large as the text file.
In the worst case scenario YAML was about 2.5 times slower than a
straight text file 'read and parse'. This despite YAML being written in
C. In a better case scenario YAML was still slower by a factor of
2. We are talking about seconds instead of minutes here; e.g., 21
seconds for YAML, 10 seconds for the text file read-n-parse.
Of course at 300,000 records a person might just want to use a
database instead. So how does YAML work with small real-life data
sets? So as my second test I modified our pipeline routines (which use
text files) to, optionally, read and write via YAML. There the datasets
are much smaller -- we do have a file with 100,000+ records but each
record is fairly small -- most of the other datasets have much fewer
records although more data per record. So what is the conclusion? The
read times between reading the YAML files and the text files still has a
difference but since the files are small the difference may be a second
if at that. What is more troubling is that YAML has problems with data
types. I was having problems with reading in data in the form '01234'
until I realized that YAML was converting to octal (except in cases like
'09123' -- with the non-octal 9 in there). This can be taken coded
around but, still, it makes YAML not as friendly as desired.
YAML does have the advantage of producing more easy to read files.
An example, a YAML file could look like:
Phone: "(765) 494-0505"
Instead of a more cryptic line in a text file of:
1) YAML is slower than text read-n-parse; but not significantly with
sub-100,000 record files.
2) YAML's data conversion can be troubling.
3) YAML does produce a more friendly-to-edit file.
I am still up in the air about YAML's usefulness. I will probably
continue to use it for some data files. I might slowly convert our
pipeline to using it. But carefully because of the data conversion
Rick Westerman westerman at purdue.edu Bioinformatics specialist at the
Genomics Facility. Phone: (765) 494-0505 FAX: (765) 496-7255 Department
of Horticulture and Landscape Architecture 625 Agriculture Mall Drive
West Lafayette, IN 47907-2010 Physically located in room S049, WSLR
More information about the Purdue-pm