[Purdue-pm] YAML

Thu May 22 13:30:22 PDT 2008

   As people who attended the last PM meeting know,  I talked about 
YAML::SYCK and how the YAML data serialization format can be used.   
Derrick asked  what the performance of YAML was compared to just reading 
a file.

   I tried two different methods to find this out.  

   First I did a random made-up data set of about 300,000 people with 
addresses, names, phone numbers, etc.   As one might expect, the size of 
the YAML file is larger than a non-YAML file since the YAML file has to 
contain tag words in addition to the data.   Exactly how much large 
depends on the keyword size to data size ratio but at the worst case one 
would probably expect no more than a 2:1 ratio.   Thus the YAML size 
would be  3 times as large as the text file.  

   In the worst case scenario YAML was about 2.5 times slower than a 
straight text file 'read and parse'.  This despite YAML being written in 
C.    In a better case scenario YAML was still slower by a factor of 
2.   We are talking about seconds instead of minutes here; e.g., 21 
seconds for YAML, 10 seconds for the text file read-n-parse.

   Of course at 300,000 records a person might just want to use a 
database instead.   So how does YAML work with small real-life data 
sets?  So as my second test I modified our pipeline routines (which use 
text files) to, optionally, read and write via YAML.  There the datasets 
are much smaller -- we do have a file with 100,000+ records but each 
record is fairly small -- most of the other datasets have much fewer 
records although more data per record.  So what is the conclusion?  The 
read times between reading the YAML files and the text files still has a 
difference but since the files are small the difference may be a second 
if at that.  What is more troubling is that YAML has problems with data 
types.  I was having problems with reading in data in the form '01234'  
until I realized that YAML was converting to octal (except in cases like 
'09123' -- with the non-octal 9 in there).   This can be taken coded 
around but, still, it makes YAML not as friendly as desired.

   YAML does have the advantage of producing more easy to read files.  
An example, a YAML file could look like:

1:
  LastName: Westerman
  FirstName: Rick
  Phone: "(765) 494-0505"

   Instead of a more cryptic line in a text file of:

1:Westerman:Rick:(764) 494-0505:

   Bottom line:

   1) YAML is slower than text read-n-parse; but not significantly with 
sub-100,000 record files.

   2) YAML's data conversion can be troubling.

   3) YAML does produce a more friendly-to-edit file.

   I am still up in the air about YAML's usefulness.  I will probably 
continue to use it for some data files.  I might slowly convert our 
pipeline to using it.  But carefully because of the data conversion 
problems.

-- 
Rick Westerman westerman at purdue.edu Bioinformatics specialist at the 
Genomics Facility. Phone: (765) 494-0505 FAX: (765) 496-7255 Department 
of Horticulture and Landscape Architecture 625 Agriculture Mall Drive 
West Lafayette, IN 47907-2010 Physically located in room S049, WSLR 
building