DAP4: DAP4 On the Wire Format

From OPeNDAP Documentation
Revision as of 19:14, 14 March 2012 by Ndp (talk | contribs)
⧼opendap2-jumptonavigation⧽

Background

The current DAP2 clients, use two different approaches to managing the packet of data that is sent by the server.

The C++ libdap library uses what I will call an "eager" evaluation method. By this I mean that the whole packet is processed when received, is decomposed into its constituent parts (e.g. data arrays, sequence records, etc) and those parts are used to annotate the parsed DDS.

In contrast, the oc library uses a "lazy" evaluation method. That is, the incoming packet is sent immediately into a file or into a chunk of heap memory. Almost no preproccessing occurs. Data extraction occurs only when requested by the user code through the API.

Problem addressed

The relative merits and demerits of lazy versus eager are well known and will not be repeated here.

Lazy evaluation of the DAP2 packet is hampered by the inlining of variable length data: sequences and strings specifically. If it were not for those, the lazy evaluator could compute directly the location of the desired subset of data as requested by the user and without having to read any intermediate information. When, for example, Strings are inlined, then it is necessary to walk the packet piece by piece to step over the strings.

I plan to use lazy evaluation for DAP4, and propose here a format for the on-the-wire data packet that makes lazy operation fast and simple without, I believe, interferring with eager evaluation.

Proposed solution

Since we are agreed on the use of multipart-mime, the incoming data is presumed to be sequence of variable length packets with a known length (for each packet) and a unique id for each packet.

Under these assumptions, I propose the following format.

  1. The initial packet is of known computable length, aa "fixed length" for short. That is, its size can be computed solely knowing the DXD for the incoming data. This means that strings and sequences are not represented inline, but instead are represented by fixed-size "pointers" into other, following packets that contain the sequence and/or string data.
  2. Each element in a string array in the initial packet is represented by three pieces of fixed size info:
    • the unique id of the packet containing the contents of the string.
    • the offset in the packet defined in (a).
    • the length of the string in bytes (assuming utf-8 encoding).
    As an optimization, the string packet can be directly appended to the fixed size initial packet, in which case, the first item is not strictly necessary.
  3. Given a sequence object either a scalar or as an array of sequences, the sequence is replaced by the following fixed size item:
    • The unique id of the packet containing the sequence records
    Further, each record of the sequence packet is assumed to be "fixed length" by applying the rules above. This means that knowing the total size of the packet containing the sequence records, it is possible to know the exact number of records in the packet without actually having to walk the sequence packet to count them.

Rationale for the solution

The above representation makes lazy evaluation very simple and a given item in a packet can be reached in o(1) time. Even with the case of nested sequences, the proper item can be reached in o(log n) time where n is the depth of the nesting.

The cost is that a hash map is needed to map unique id's to offsets in the file or heap memory.

The lazy versus eager cases also apply on the server side. Currently, for example, the opendap code on the thredds server takes the underlying data source (.nc file for example), converts it to DAP2 and annotates the DDS with the data. Then as a second pass, the annotations are converted as need and send out over the wire.

A lazy version would associate elements of the underlying source with the DDS. Tranfer of the data to the wire would then occur directly from the original source to the wire format a needed.

My hypothesis is that the proposed encoding will also simplify the use of lazy evaluation on the server side.

Addendum 2012-02-20: The above encoding has as one consequence that all embedded counts that currently exist in DAP2 are superfluous. Ditto for the sequence record markers. It may still be desirable to include the counts for purposes of error checking, but they are not strictly necessary.

-Dennis Heimbigner

Discussion

Ideas about the on-wire-data format