[Melbourne-pm] Designing modules to handle large data files

Daniel Pittman daniel at rimspace.net
Wed Aug 25 05:48:25 PDT 2010

Sam Watkins <sam at nipl.net> writes:
> On Tue, Aug 24, 2010 at 04:05:38PM +1000, Daniel Pittman wrote:
>> > The format I'm suggesting is like YAML-lite, without the kitchen sink, as
>> > used in email and http headers.
>> Ah.  So, it is entirely insensitive to linear whitespace inline, are not
>> LWS-preserving, have a limit of 998 and 78 characters total and per-line,
>> possibly including or excluding LWS, in an implementation defined fashion,
>> have case-insensitive and ASCII-only keys, and contains only ASCII characters
>> without encoding in one of URL or RFC2047 MIME word format, then.
>> Right?
> No.  I assume you're being sarcastic and attempting to demostrate how
> unsimple the header formats are.

I think mostly bitter, because "simple" formats usually don't turn out to be,
and like CSV this is one of my least favorite. :)

> I am impressed by your knowledge anyway!  I use something simpler than that.
> If a particular application wants to reject long lines or specify an
> encoding, that's not my concern.

*nod*  My point was, in part, that it isn't as simple as it sounds, because
HTTP headers and Email headers have a whole lot of really weird properties
as a result of their history.

So, yeah: for your own use, not a problem.  Any problem is easy when you don't
have to interoperate.  It gets tricky when you add other people, because you
never know which out of those we both might thing were in or out unless we
actually discussed it. :)


>> > Other formats like XML and even YAML and JSON are unnecessarily
>> > over-complicated in my opinion.  Simplicity, Clarity, Generality!!
>> Sadly, without defining what you mean that very vague description doesn't
>> actually *specify* anything, just give a vague (and English/ASCII oriented)
>> hint in the general direction of what you were thinking.
> sure, this conversation is not a specification.  The format I have in mind
> is crystal clear, simple and unambiguous, and I can supply parsers and
> formatters for it in perl if you like.

Nah: just make sure that, if you are documenting it, you do supply a strict
specification with it — because it is harder than it sounds.

>> Much as I hate, loath and detest much of the hype around it, the one thing
>> that XML got right (which, naturally, it inherited from SGML) is that it
>> actually specifies the details of how you process arbitrary data in that
>> format.
> I do like plain simple XML for markup, that's what it's for.  I do not like it
> as a hierarchical file format for storing records, that is a misuse of XML.

*nod*  SGML is terrible for structuring data.  It is wonderful for doing basic
markup, though, which coincidentally is what it was designed for initially.
Who would have thought?


>> Most of the "simple" things either don't scale to cover the world, or don't
>> actually specify enough that you end up with crazy, crazy things.  (STOMP,
>> I am lookin' right at you, here.)
> So what are you saying, that I'm crazy, crazy?
> Which of my things are 'crazy, crazy'?

Ah, no.  Sorry.  I was absolutely not calling you crazy, and I am sorry that
I wasn't clear about that.

No, I was calling the situation that grew up around STOMP crazy: because the
specification was so loose, and poor, you end up with a whole lot of versions
that don't work together, and all sorts of conventions you need to understand
to make it work that are not in the "spec", but are in most real-world

At that point you don't have any more a *simple* messaging protocol, but a
crazy mess full of work-arounds and other nasty stuff.


> I think a data format which can be produced and parsed in say 10 lines of
> code, and is simple, clear and general, such a format is a lot less crazy
> that the crock of complexity and featuritis which is full-blown XML.

Almost certainly.  The trick is getting everyone who works with that data to
agree on the *same* ten lines of code, and their interpretation. ;)

✣ Daniel Pittman            ✉ daniel at rimspace.net            ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons

More information about the Melbourne-pm mailing list