Expway's Position Paper on Binary Infosets

Robin Berjon, Expway <robin.berjon@expway.fr>
date: 2003-08-07

This document summarizes part of Expway's several years' work on binary infosets, covering requirements from customers and industrial consortia, a technical overview of the solutions that were provided to address them, and general information on the results that were obtained. The intent is less to describe a specific technology than to report on generic solutions in this domain and to provide leads on how a single binary infosets framework can address different needs expressed by a wide variety of unrelated users.

As such, it does attach itself more to a description of the landscape and how its pieces fit together than to the results obtained using Expway or Expway-related products, be they BiM, BinXML, or other experimental formats we may have tested. However, benchmarks of Expway's latest tools on a large variety of grammars will be made available to the workshop as an annex to this document.

Considered Requirements

One of the more frequent questions that comes up when discussing binary infosets is that of whether a single solution can operate efficiently on a vast and uneven set of requirements, from GUI applications using document-oriented vocabularies such as SVG on unidirectional networks, to high-performance data intensive SOAP messages sent over broadband LANs (to take slightly caricatural examples). Indeed, the domains in which XML is used cover so much ground, and the situations into which its adopters wish to bring as much of its value as possible are so diverse, that the variety of the requirements being expressed with regards to binary infosets is larger than most expect.

Rather than to try and eliminate some of these requirements, it is our belief that they can all be addressed within a single framework. Moreover, addressing a single smaller set of requirements will lead the parties that have been left with too many requests unanswered to invent yet more ad hoc solutions at high costs and low interoperability. The following list of requirements is a direct translation of what, as a company specializing in binary infosets, we have collected as items found desirable in a binary infosets format. It is likely that most uses taken apart will only need a subset of these features. But looking at how the XML landscape is shaping up, it seems foolish to consider vocabularies in isolation, and in fact in the same way that XML blurs the frontiers between domains, binary infosets should so provide sufficient genericity that users keep the freedom and flexibility they find in XML while using a different format.

"Doctor, it hurts when I do this"

There used to be a time when people would complain about trying to do this or that in XML, and finding that it was too big, not fast enough, not fragmentable, not amenable to binary payloads, etc. (pick any subset) and the natural answer would be "well, simply don't use XML then".

Things have changed since then. XML benefits from an omnipresence that carries its own benefits in quality tools, easy integration, or developer availability. For every single requirement presented here there is the "but why would you use XML" answer. For every single one of them, there are motivated decisions based on the fact that the benefits that XML brings to most workflows are great enough that if XML breaks down for a small part of the system, the cost of producing a solution to address the problems of those few parts is worth considering.

Size

Size is probably the most cited reason for work on different serialisations of the Infoset. Often enough, this is just the gut reaction of a developer confronted with XML for the first time, and shown to be of little concern in many cases. However there are a number of situations in which the concerns are real, based on genuine experimence.

In such situations, the first solutions to come forth are generic compression approaches such as gzip or bzip. When size is the only problem — a rare situation — those methods usually address the problem easily and efficiently. However, generic compression normally gets in the way of speed (since an extra step is added) or fragmentability. Sometimes (though rarely), one also wants to squeeze out every single bit even if other features are not needed.

Expway has found that a variety of methods, such as schema-based encodings (sometimes used in conjunction with vocabulary-specific codecs, sometimes coupled with generic compression) perform much better than generic compression in size. The compression levels obtained on the structure of an XML document depend on the predictability of its structure, but obtaining more than 85% compression is far from rare. In some extreme cases (mostly of SOAP or SOAP-style messages) some much higher compression ratios have been seen. Type-specific codecs have also shown to produce excellent results.

Speed

The truth is that most of the requests for size reduction are really indirect requests for greater speed. Apart from pay-per-byte networks, most people wouldn't care if the data sent to them weighed several gigabytes if it displayed in under a second on their cell phone, or gave excellent performance to their EAI system.

Size reduction is an important part of increasing speed as it will reduce bandwidth issues and time spent in I/O. However it is not the only one, and in fact a host of elements can contribute to speed-ups. These include format complexity (never knowing what the next byte will be), conversions between internal binary types (such as integers) and their lexical representations, character encoding transcoding, or the ability to skip parts of a document not used by the application.

While it is difficult to iron out all of the potential friction in any format (notably with regards to endianness or text transcoding), it has been Expway's (and many others') experience that a binary format can provide for much faster encoding and decoding than XML.

Also worthy of note is the fact that on some devices, when considering speed issues it is important not to simply look at CPU power alone. Some battery-powered devices frequently change their CPU speed to match the amount of work they need to perform. A CPU may thus be able to perform a given parsing/decoding task at acceptable speeds when used at its maximum power, but it will burn more battery. Since it would seem that batteries aren't subject to Moore's law, it could well be that while handheld devices may keep getting more powerful, it will still be a requirement to use as little CPU power as possible in order to maintain battery life.

Random Access & Dynamic Update

Parts of what this section covers are often filed under "streaming", a convenient term that means altogether too many things and has thus become a touch confusing when left unqualified. We tend to prefer using more descriptive terms as they are more likely to be understood correctly.

Random access describes the ability to jump to any part of a document, possibly using typical XML access methods such as XPath or the DOM, without having to read the rest of it (or, depending on the solution, at least with the guarantee that only minimal reading of the other parts will be needed to skip them).

Random access has many advantages since it makes it possible to ignore unwanted data in an XML document. The applications are many, including indexing, SOAP intermediaries reading only some headers, lazy DOMs, XML databases, and more.

Dynamic update is the ability to update part of a document without needing to send more than the parts that changed. Such a feature is of vital importance in unidirectional broadcasting systems that do not receive requests from clients but instead carousel data continuously so that the required parts of the document are made available irrespective of the moment at which the client started to listen to that channel, while still making it possible for more important information to be prioritised over less important parts such as eye candy. XML streams that need be synchronised with time-based media benefit greatly from this feature, which effectively allows clients to start reading a document at arbitrary points as it happens with AV streams. Known uses include Electronic Program Guides, video metadata, SVG slides, or timed text.

Dynamic update can also be used for XML client-side applications in low-bandwidth situations so as to refresh only part of an application's interface when an action occurs. This allows for a richer user experience at much smaller a cost.

The reason these two requirements have been bundled together is because they both rely on the document being readily fragmentable, in extreme cases down to the single node level. In turn, fragmentability imposes strong constraints on the immediate availability of the context in which a node exists inside a document. Some of these constraints should normally be internal to the format — the namespace of each element and attribute should be present without having to resolve it through in-scope declarations — while others may be better left off to an indexing system, typically ones corresponding to the optimization of generic queries (eg. full ancestry context for XPath queries).

Extensibility

While structure and fairly simple text content can easily be described by a variety of schema languages, there are cases in which increased compression and speed can be derived from encodings that are specific to certain non-XML textual structures found in some vocabularies. The archetypal example of this is SVG path data, which is structured information that can't efficiently be described by existing schema languages (though were they more powerful it could be).

Per vocabulary extensions are frequently desirable for a variety of reasons. Expway has implemented a system that allows codecs to encode specific types in manners more appropriate than the defaults. In the case of SVG, we have tested lossy compression applied to paths where the precision of the data was much higher than what could possibly be rendered by the target device.

In some situations, it can be a good idea to gzip-compress the textual content of a document in addition to the structural compression provided by the format to obtain excellent total compression. This of course needs to be weighed against other needs such as speed.

While the greater freedom provided by pluggable codecs is often listed as a top feature requested by users, it is a feature to be used wisely as it brings forth potential interoperability issues. However, we firmly believe that there are several ways in which these interoperability issues can be addressed so that vocabularies in which codecs are desirable may use them and remain interoperable. Expway is currently investigating multiple options, covering HTTP content negotiation (which unfortunately may require modification to HTTP content codings so that they support parameters the way that content types do), extended content negotiation using media queries derived from the ones used in CSS, CC/PP, and more powerful facilities to describe textual micro-structures from within schema languages. Given correctly defined interoperable behaviours of encoders and decoders, constructing a solution that is both simple and interoperable should be straightforward.

Packaging & Binary Payloads

One domain of XML activities that continuously resurfaces is that of packaging, recently more often in the guise of a simpler subset of general XML packaging which is the ability to include binary payloads in XML documents, or in the XML Infoset. Work on MTOM is one example, and the many uses of base64 encoded binary files another (for instance, SVG mandates support of the data: URI scheme, used to inline base64 data in a URI).

The ways in which binary infosets can help integrate XML with binary payloads are rather obvious. Since there needs be no constraint on the content, binary data can be inserted directly into a binary infoset stream (this may be supported in an ad hoc manner by the format, or through its understanding of XML Schema types). Expway has experimented with replacing data: URIs with their binary content, and making that binary content directly available through its low-level API. The gains in compression and speed are naturally quite noticeable.

In addition to that is a different, more complex problem: that of including XML documents in another packaging format (which may itself be an XML packaging format). Typical packaging will requires features such as streaming (in the sense that the content needs to be usable before it is downloaded completely), compression, and interleaving/multiplexing (the ability to send fragments of the packaged documents in any order within the same stream). A typical expression of those requirements can be found in this post to the SVG WG from a few months back [member only]. Compression is addressed by the compaction of binary infoset formats, streambility by dynamic update, and interleaving by the ability of binary infosets to be easily fragmented and to carry arbitrary binary payloads. This allows one to create a single feed integrating multiple media and XML fragments in such a manner that one is guaranteed to see them delivered in a synchronised manner. Many have stated that it would be tremendously useful for SMIL, SVG, printing, or various multimedia documents that would be simplified by this approach.

Schema change resilience

In this section I focus more closely on BiM, since it covers features that I have not seen available in other solutions, but that I consider to be of high importance.

The BiM format originates in the MPEG world, a setting that imposed some heavy constraints on its inception. One of those requirements was to be resilient to changes in the schema that originated the format of the binary infoset, as it is unrealistic to upgrade or physically replace some types of terminal.

To address this concern, a BiM decoder uses its skipping feature to only process the parts of a stream that it understands, ignoring newer or older schemata. Naturally, this feature comes at the cost of a larger bitstream, and it is thus not recommendable to continuously change one's schema (if it ever is). This feature is one that, unsurprisingly, has been considered essential in large scale production systems that put terminals in consumers' hands.

In addition to that, BiM has the ability to transmit a lightweight version of the schema it uses for encoding. It is thus possible to transmit new versions of a schema (or even radically different ones) to such generic decoders that will thus become upgradable at low cost. This facility opens the door to self-describing binary infosets, when such a feature is deemed necessary.

Accessibility & I18N

While it may seem somewhat daring to lump two large considerations as these into the same section, it is done so due to the fact that where binary infosets are concerned they have the same problems and the same solutions. Looking at a binary infoset from the "text editor" point of view, it will be an opaque bunch of bytes. The reason for this is evident: in order to gain its new features, a binary infoset format sacrifices human readability. Seen from that angle, it makes little sense to speculate on the accessibility or internationalisation of an opaque sequence of bytes. There is no way to assess whether it follows good I18N practices since what text may be discernible carries no expression of its encoding, and it can't possibly comply with, for instance, the XAG.

However, the restoration of such features is no more than one level of indirection away: read into an Infoset, the content of the document maintains all the I18N features of XML, and will be just as accessible as the original vocabulary was. In truth, these issues are no more present in the case of binary infosets than in that of the usage of a content coding that is not available to the end user (as for instance bzip might not be). If however the software required for the decoding is freely available to all, then given that binary infosets can be converted to XML no new issues are introduced.

It needs to be stressed however that some systems would benefit greatly from the ability to use XML related technology and thus benefit from their greater accessibility and internationalisation, that currently can't due to limitations of size or issues with dynamic updates that preclude streaming XML alongside audio or video feeds. The University of Innsbrück, Austria, has been investigating ways in which to stream subtitles and XHTML or SVG slides alongside audio or video lectures. Their system needs to be fully internationalised and must be as accessible as possible since one of its goals is to bring e-learning classes to people that may not be able to leave home, that in as many countries as possible. Given the existing constraints on bandwidth and the fact that it must be possible to jump from place to place in the audio or video stream, they are considering the use of binary infosets.

Technical Considerations

This section aims to provide some considerations derived from research and implementation experience which we believe may be of interest to the binary infosets community in general, and to workshop participants in particular.

Schema-based vs. Token/Dictionary-based

Dictionary-based encodings use a table of common tokens (usually element and attribute names) in order to encode the infoset using shorter integers in place of the names. Schema-based encodings use the grammar described in a schema to encode the infoset as a series of grammatical items, skipping the ones that are predictable.

There has been a lot of discussion as to which is the best, with arguments frequently pitting simplicity against size, or issues regarding namespaces or streamability. Within the schema-based encodings, there have been the expected (and occasionally heated) arguments about which schema language is best (of XML Schema, RelaxNG, or ASN.1 — DTDs are rarely considered anymore, and to the best of my knowledge no one has yet attempted to use Schematron for these purposes).

We believe however that these disputes need not occur. To begin with, dictionary-based encodings can be seen as a special case of schema-based encodings. To use XML Schema terminology, they are equivalent to a schema in which all elements are global, may contain any number of themselves, where all attributes are available on all elements, and where character data may appear anywhere. That is, a dictionary defines a loose schema.

In addition to that, most schema languages define features that are not being used by the binary infoset encodings that rely on them. This is an issue because it imposes greater complexity on the implementations than ought to be necessary. A common model expressing the expected structure of a vocabulary and to which a variety of schema languages could be mapped is unlikely to be difficult to describe — something along the lines of what remains of RelaxNG after the simplification pass should be very close to the actual needs of binary infosets.

Given that, it should be possible to map XML Schema, ASN.1, dictionaries, and RelaxNG onto such a generic grammar in order to generate encoders and decoders. Having a well-defined and consistent structure also means that a format can easily be devised to transmit the generic grammar or store it alongside the encoded content so that it is self-describing, as well as to allow decoder updates in the event of schema evolution.

Such an approach would satisfy the needs of a larger community with a single specification, and simplify the implementations in many cases.

Extensibility

As explained in the section on requirements, there is a call for extensibility in order to better address the needs of specific vocabularies. It is our belief that this is the single most difficult problem to solve in the binary infosets space, but that solutions can be reached.

We see several different approaches to this issue, the goal of which are to maintain interoperability while still allowing more powerful encoding of textual types.

The first and most obvious one is based on negotiation and capability mechanisms. These naturally include HTTP content negotiation, but may in fact be better solved by other tools such as CC/PP, or perhaps something resembling CSS Media Queries.

The second one relies on the W3C designing an extensibility framework to replace the unpleasant recourse to plugins that currently dominates the Web. Such a framework would be able to deal with the intermixing of generic and specific encodings in an interoperable way.

The third and final one (of those that have currently been given thought to) involves ways to reproduce structure in a generic manner at the character data level. Expway has done research in this field, using XML Schema simple types composited with choice and sequence to generate parsers that can extract structure from things ranging from simple CSS values to SVG path data. The same method could be used to define matrices based on simple lists as have been requested for instance by MPEG-7 and X3D. While still experimental this approach has already shown to be workable in many cases, and less complex than expected. It has the advantage of allowing one to describe elaborate micro-structures and thus to encode them efficiently.

Arbitrary XML

In most cases, users know in advance the full grammar that they will be using within their XML application. However, since XML is extremely composable, even in cases where the schema is fully known one feels naturally more at ease knowing that XML's flexibility is respected in a binary infoset.

And for a smaller, but nonetheless important and certainly substantial, set of cases using arbitrary namespaces in a document is customary, and even recommended. Vocabularies that are typical of this situation include for instance NewsML and SVG.

As a result, we consider it unacceptable both based on user requests and on adhering to "XMLness" not to support arbitrary XML as allowed by a schema (using wildcards), with as many as possible of the features available when the schema is known. To this effect, we have devised a straightforward way to encode arbitrary XML that reuses the format used when a schema is available. Naturally, it provides lesser — but still acceptable — compression and stores all character data as text while maintaining all the other features such as speed and fragmentability. It also keeps track of declared namespace prefixes in order not to corrupt QNames that may appear in schema-less parts of the document.

APIs for Varying Needs

A large part of the speed and size gains are obtained through the use of various datatypes stored in their native (or closer to native) representations. On the other hand, much of the interoperability (with XML applications) found in binary infosets comes from using the same APIs (mostly SAX and DOM). This creates a mismatch in that translating the native representations to text to see them translated back to native again is obviously slower than the original text to native translation.

To solve this quandary Expway provides both the usual XML APIs and a specific low-level API that does not translate native representations to text. Whether there is a need to standardize on a common API, or whether it will surface if it is needed the way that SAX did (and probably based on SAX, or on a pull API, if it does) is an open question. I believe that question shouldn't be addressed before a format is specified and more experience has been gathered.

It is probably worth noting that specifications such as MTOM that put binary data in the Infoset may, too, require some reflection in this area.

Domains and Vocabularies

Over the past years, Expway has tested and refined its solutions on a varied range of domains and vocabularies. This section intends to give a short overview of what has been implemented and researched.

Multimedia & Related Technologies

Since BiM originates in MPEG-7, it is only natural that multimedia technologies have been given close consideration. Multimedia is an area that will typically expand to use as much of the available resources as possible, stopping only when it meets a barrier, often either in processing power or in bandwidth.

As such, any solution which reduces the resource consumption of any part of the system is much welcome as it directly increases the amount of things that may be done. We have consequently explored the application of our technology to such vocabularies as SVG, SMIL, or XHTML. This has led us to investigate domains relating to mobile applications, as well as to broadcasting.

And not to forget that in multimedia content is at least as important as presentation, binary infosets have also been applied to such things as EPGs (Electronic Program Guides) within the realms of the TV-Anytime Forum and DVB, so as to allow large amounts of data to be made available to end-users describing their many channels, and thus to significantly enhance their experience.

Web Services

The performance of SOAP being a frequent complaint, it was only natural that Web Services were also studied closely. To this effect we have tested prototypes (relying on WSDL to provide the encoding information) to show that binary infosets work unsurprisingly well on such vocabularies.

This eventually led to a participation in TV-Anytime's bi-directional services specification, that relies on BiM to efficiently encode the SOAP message that terminals exchange with the EPG server.

Metadata & RDF

Since MPEG-7 is a metadata oriented specification, we have looked into applying our technology to RDF. However, we have found that working at the Infoset level is of little direct interest to RDF, where working at the RDF model level would be much more interesting.

This avenue of research is currently on stand-by, but we are keeping an eye on that area as we have received positive feedback from RDF users asking for a common binary format that is fast, easily fragmentable, and easy to index (preferably in ways that are harder with such storage methods as RDBMSs). Current ongoing work from various parties on addressing the mismatch between trees and graphs may unlock possibilities for integration in that domain, and we will be glad to explore solutions in that space again.

Security

While we haven't yet defined a prototype in this area, we are considering doing so in the close future. The way our format works, a BiM document can be considered to be a canonicalisation of the Infoset it encodes (taking into account XML Schema types in addition to vanilla XML C14N). It can thus be digitally signed as is in a way that is guaranteed to be consistent. Since it is readily fragmentable, any subtree can itself be signed, with a granularity of a single node.

Miscellaneous

It is impossible, within the bounds of this document, to provide a complete overview of what has been explored but a number of important vocabularies such as for instance GML were tested with positive results. Some vocabularies tend to exercize some parts of the format more than others, we believe we have tested a sufficient number of applications to be conclusive in a large part of our findings.

Position and Conclusions

It is Expway's position that the W3C should form a Working Group to pursue work in this area. The motivations for this position are many, a few of which are summarized here:

The Web as a whole
The W3C is the only consortium that considers the Web as a whole and addresses problems concerning the entirety of its angles, as opposed to producing work on fragmented parts alone that are later less applicable to the larger picture.
As such, it is more likely to consider requirements such as accessibility or internationalisation, or the needs of domains that have only started their integration into the Web but will soon be pulled completely into it such as the mobile or broadcast industries.
Integration with Web technology
In addition to Web requirements, the body of Web technologies is large and growing. Producing a format that integrates cleanly with as many as possible of them (and preferably all) requires input from those that know those that create them. When producing a format that applies equally well to such a wide range of technologies spanning from Web Services to SVG, through XHTML and video metadata it is altogether too tempting to only consider a subset of the whole and declare victory. If various vocabularies define their own binary format in isolation from one another, they will lose the ability to be used together that they currently have, with increasing integration.
Even a generic encoding of the XML Infoset, without input from the great variety of those that use it in practice, is more than likely to turn a blind eye to the more specific, yet addressable, needs of a given domain.
Finally, the benefits that binary infosets may provide extend beyond their simple "application" to a set of vocabularies as they facilitate such long lasting issues as fragmenting and packaging. It seems to us that those will be much more easily reaped by the W3C.
Openness
A technology covering such a broad range of applications ought to be guaranteed to be as open as possible. This includes both freedom from encumbrance and openness of the process so that interested parties may recognise their needs during the development of the specification and join in, and so that the Open Source community may get to work on implementations as early as possible, thus providing valuable feedback.
Balkanization
The time to produce a global standard in this domain is now. Already much fragmentation has occured with vertical or quasi-vertical industry consortia and companies developing local solutions that solve problems predominantly in their space. Given the accrued webization of some domains such as mobile devices if no steps are taken now to ensure interoperability it is increasingly likely that one solution will impose itself de facto. If it doesn't integrate well with other Web technologies, and work is started then, it will be much harder to compete with it.