DAP4: Encoding for the Data Response: Difference between revisions

Revision as of 18:58, 6 June 2012

Background

There are two different approaches to deserializing the data received by a DAP client: The client may process the data as it is received (i.e., eager evaluation) or it may write those data to a store and process them after the fact (lazy evaluation). A variant of these techniques is to process the data and write it to a store, presumably because the initial processing steps are useful while having the data stored for later processing enables still other uses. However, in this document I'm not going to look at the latter case because experience so far with DAP2 has not provided any indication that would present any performance benefits. We do have example clients that use both eager and lazy evaluation.

HTTP/1.1 HTTP/1.1 defines a chunked transport scheme. In the past we have spent a fair amount of time on the notion of chunking as a way to achieve reliable transmission of errors when those errors are encountered during response formulation, that won't be addressed in this document. Instead, this document will assume that the entire response document described here is chunked in a way that enables reliable transmission of errors. The details of that transfer encoding will be described elsewhere.

Problem addressed

There is a need to move information from the server to a client. The way this is done should facilitate many different designs for both server and client components.

Assumptions:

Since DAP is so closely tied to the web and HTTP, its design is dominated by that protocol's characteristics.
Processing on either the client or the server is an order of magnitude faster than network transmission.
Server memory should be conserved with favor given to a design that does not require storage of large parts of a response before it is transmitted (but large is a relative term).
Clients are hard to write and the existence of a plentiful supply of high-quality clients is important (of course, servers are hard to write, too, nut there are between 5 and 10 times the number of DAP2 clients as servers).
DAP4 is designed to work over transports that are inherently serial in nature.
The response does not explicitly support a real-time stream of data (e.g., a temperature sensor which is a data structure of essentially infinite size). It may, however, be the case that the response can encode that kind of information.

Broad issues:

It should be fast
It should simple
It should be part of the web

Proposed solution

The DataDDX response document will use the multipart-mime standard. A DataDDX response is the server's answer to a request for data from a client. Each such request must either include a Constraint Expression enumerating the variables requested or a null CE that is taken to mean 'return the entire dataset.' A response will consist of two parts:

A DDX that has no attribute information and contains (only) the variables requested; and
A binary part that contains the data for those variables

Thus, while the response uses the multipart-mime standard, there are only two parts - the DDX containing variable names and types and the binary BLOB containing data.

Structure of the binary part

Data in the 'binary part' will be serialized in the order of the variables listed in the DDX part. Essentially this is the serialization form of DAP2.

Serialization of varying-sized variables

There are several kinds of varying data:

Strings
- String s;
Array variables that vary in size
- Int32 i[*];
- Float64 j[10][*];
Structure variables with varying dimensions and Sequence variables
- Structure { int32 i; int32 j[10]; } thing[*];
- Sequence { int32 i; int32 j[10]; } thing2;
Structure variables that have a varying dimension and one or more fields that vary
- Structure { int32 i[*]; int32 j[10][*]; } thing[*];

Note that there is no practical difference between a (character) String and an integer or floating point array with varying size except that the type of elements differ. Thus, the issues associated with encoding Int32 i[*] are really no different than encoding the String type. This same logic can be extended to a varying array of Structure (with fixed size fields). A varying array of Structure that contains one or more varying fields requires that the fields and the enclosing Structure be specially encoded. Note that a Sequence and a Structure with a varying dimension present essentially the same encoding issues.

Assumptions:

String: it is assumed that the server will know ( or can determine w/o undue cost) the length of the String at the time serialization begins.
Sequence and Structure { ... } [*]; It is assumed that the total size may be considerable and not known at the time serialization begins.

General serialization rules

Narrative form:

Fixed size types, including arrays: Serialized by their data, then a CRLF pair
Strings (and equivalent types): Serialized by writing their size as a N-bit integer, then a CRLF pair
Scalar Structures (which may have String/varying fields): Each field is iteratively serialized, then a CRLF pair.
Structures with fixed size dimensions (which may have String/varying fields): Each array element is serialized as if it were a Scalar Structure. A CRLF pair terminates the entire Structure (not each instance, although from the recursive nature of this process, each field will have an ending CRLF pair).
Structures with varying dimensions (which may ...): For each varying dimension, a Start of Instance marker is written, then the fields are serialized (as is the case for a Scalar Structure or Array of Structures). A CRLF pair terminates the entire Structure .
Sequences are serialized row by row: First a Start of Instance marker is written, then each of the fields of the row are serialized, until the last row of the Sequence is serialized, then a CRLF pair

Notes:

How to encode the checksum information? - By field for Structures but for an entire Sequence
The CRLF pairs simplify ensuring that implementations are correctly encoding the binary data
This will use receiver-makes-right and thus needs a header (big-endian - reference the IETF std) to convey that
Sequences cannot contain child Sequences (i.e., we are not allowing 'nested sequences') in DAP4
Both an array [20][10] and [2][10][10] will be serialized as 200 sequential elements, but array[*][*] and array[*][*][*] must include markers that indicate where the different elements lie.
This set of serialization rules can be modified slightly to support the case where fixed and varying size data are separated into different parts of a multipart-mime document.

Examples

A single scalar

Dataset {
    Int32 x;
} foo;

(NB: Some poetic license used in the following)

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>

    <<DDX here>>
--<<boundary>>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<next-id>>
Content-Length: <<-1 or the size in bytes of the binary data>>

x <<CRLF>>
--<<boundary>>

A single array

Dataset {
    Int32 x[4][4];
} foo;

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>

    <<DDX here>>
--<<boundary>>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<next-id>>
Content-Length: <<-1 or the size in bytes of the binary data>>

x00 x01 x02 x03 x10 x11 x12 x13 <<CRLF>>
--<<boundary>>

A single structure

Dataset {
    Structure {
        Int32 x[4][4];
        Float64 y;
    } s;
} foo;

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>

    <<DDX here>>
--<<boundary>>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<next-id>>
Content-Length: <<-1 or the size in bytes of the binary data>>

x00 x01 x02 x03 x10 x11 x12 x13 <<CRLF>>
y <<CRLF>>
<<CRLF>>
--<<boundary>>

An array of structures

Dataset {
    Structure {
        Int32 x[4][4];
        Float64 y;
    } s[3];
} foo;

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>

    <<DDX here>>
--<<boundary>>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<next-id>>
Content-Length: <<-1 or the size in bytes of the binary data>>

x00 x01 x02 x03 x10 x11 x12 x13 <<CRLF>>
y <<CRLF>>
x00 x01 x02 x03 x10 x11 x12 x13 <<CRLF>>
y <<CRLF>>
x00 x01 x02 x03 x10 x11 x12 x13 <<CRLF>>
y <<CRLF>>
<<CRLF>>
--<<boundary>>

A single varying array (one varying dimension)

Dataset {
    Int32 x[4][*];
} foo;

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>

    <<DDX here>>
--<<boundary>>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<next-id>>
Content-Length: <<-1 or the size in bytes of the binary data>>

<<SOI>>x00 x01 x02 x03
<<SOI>>x00 x01 x02 x03 
<<SOI>>x00 x01 x02 x03 
<<CRLF>>
--<<boundary>>

NB: There's know way to know that there are only three 'rows.'

A single varying array (two varying dimensions)

Dataset {
    Int32 x[*][*];
} foo;

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>

    <<DDX here>>
--<<boundary>>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<next-id>>
Content-Length: <<-1 or the size in bytes of the binary data>>

<<SOI>> 
<<SOI>> x <<CRLF>>
<<SOI>> x <<CRLF>>
<<SOI>> x <<CRLF>>

<<SOI>> 
<<SOI>> x <<CRLF>>
<<SOI>> x <<CRLF>>
<<SOI>> x <<CRLF>>

<<SOI>> 
<<SOI>> x <<CRLF>>
<<SOI>> x <<CRLF>>
<<SOI>> x <<CRLF>>

<<CRLF>>
--<<boundary>>

NB: Three rows and three columns; spaces added to make things easy to see

An varying array of structures

Dataset {
    Structure {
        Int32 x[4][4];
        Float64 y;
    } s[*];
} foo;

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>

    <<DDX here>>
--<<boundary>>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<next-id>>
Content-Length: <<-1 or the size in bytes of the binary data>>

<<SOI>>x00 x01 x02 x03 x10 x11 x12 x13 <<CRLF>>
y <<CRLF>>
<<SOI>>x00 x01 x02 x03 x10 x11 x12 x13 <<CRLF>>
y <<CRLF>>
<<CRLF>>
--<<boundary>>

NB: this time only two rows...