[Melbourne-pm] Designing modules to handle large data files

Sam Watkins sam at nipl.net
Tue Aug 24 21:23:31 PDT 2010


On Tue, Aug 24, 2010 at 04:05:38PM +1000, Daniel Pittman wrote:
> > The format I'm suggesting is like YAML-lite, without the kitchen sink, as
> > used in email and http headers.
> 
> Ah.  So, it is entirely insensitive to linear whitespace inline, are not
> LWS-preserving, have a limit of 998 and 78 characters total and per-line,
> possibly including or excluding LWS, in an implementation defined fashion,
> have case-insensitive and ASCII-only keys, and contains only ASCII characters
> without encoding in one of URL or RFC2047 MIME word format, then.
> 
> Right?

No.  I assume you're being sarcastic and attempting to demostrate how unsimple
the header formats are.  I am impressed by your knowledge anyway!  I use
something simpler than that.  If a particular application wants to reject long
lines or specify an encoding, that's not my concern.

> > The only addition over those is the blank-line as record separator.  It's
> > the same as debian package files.
> 
> Once you add that it becomes clearer.  So, do you support the 'single period'
> syntax for whitespace inside a line-folded record, and the optional non-folded
> headers that Debian package control files do, or not?

I think it's useful to support multi-line values.  The single period thing
sounds reasonable, but I would probably go with simplicity over readability and
just use a lone tab or indent to indicate a blank line in the middle of a
value, like this (a bad example as addresses seldom contain blank lines!):

address: Spry Street,
	Corburg North
	
	3058

Given that any more value lines after such a blank line must be indented, and
headers must not be indented, it's not really a visual problem to omit the
period.  The difficulty might be that some editors are reluctant to indent
blank lines, no big problem I think.

> > Other formats like XML and even YAML and JSON are unnecessarily
> > over-complicated in my opinion.  Simplicity, Clarity, Generality!!
> 
> Sadly, without defining what you mean that very vague description doesn't
> actually *specify* anything, just give a vague (and English/ASCII oriented)
> hint in the general direction of what you were thinking.

sure, this conversation is not a specification.  The format I have in mind is
crystal clear, simple and unambiguous, and I can supply parsers and formatters
for it in perl if you like.

> Much as I hate, loath and detest much of the hype around it, the one thing
> that XML got right (which, naturally, it inherited from SGML) is that it
> actually specifies the details of how you process arbitrary data in that
> format.

I do like plain simple XML for markup, that's what it's for.  I do not like it
as a hierarchical file format for storing records, that is a misuse of XML.

The format I'm describing can hold values with arbitrary binary data (or text
in any chosen encoding) without the need for any escaping or encoding.  This is
simple and comprehensive.  It would normally be used with utf-8 encoded keys
and data I suppose, but it would be acceptable to insert binary or
differently-encoded data for certain particular keys.  The application can
interpret the values however it wishes.

> Most of the "simple" things either don't scale to cover the world, or don't
> actually specify enough that you end up with crazy, crazy things.  (STOMP,
> I am lookin' right at you, here.)

So what are you saying, that I'm crazy, crazy?
Which of my things are 'crazy, crazy'?
Don't tell me they've got you maintaining some of my perl code?
I don't understand your apparent hostility, the 'maintainer' conjecture is the
only explanation that comes to mind.

I think a data format which can be produced and parsed in say 10 lines of code,
and is simple, clear and general, such a format is a lot less crazy that the
crock of complexity and featuritis which is full-blown XML.


Sam


More information about the Melbourne-pm mailing list