DAP4: DAP4 Multipart Mime Format: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
(Created page with "* Use a multipart MIME document to hold the DDX and one ''data blob'' per DDX. Adopt the same use of Multipart MIME as WCS uses. * consider the following design idea for embeddi...")
 
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
* Use a multipart MIME document to hold the DDX and one ''data blob'' per DDX. Adopt the same use of Multipart MIME as WCS uses.
[[Category:Development|Development]][[Category:DAP4|DAP4]]
[[OPULS_Development| << Back to OPULS Development]]


* consider the following design idea for embedding type information within the data stream:
Note: This document is a revision of material from
<blockquote>In DAP2, the DDS is used as a descriptive header for data. For gridded data that works OK but for point data it doesn't work so well. When the data are read as from a stream, it's OK, but when the data are read and stored, then used, they need to be in a linked structure. Because the DDS contains the data type definition and not the value, it does not hold a structure suitable for holding values. Look at how the d_values field in Sequence works (see Sequence::print_val(), deserialize()). A better approach would be to encode the type of a variable in the data stream itself and have the reader (client most of the time) build instances as needed. For arrays, structures and simple types this is mostly a wash, but for Sequences it would be a major plus because the protocol could support sequences of complex objects more easily. It would also get rid of the odd situation where a DDS holds a type definition for a nested sequence while the top-most sequence holds the tree of objects which hold values. (I.E. the child sequences in the DDS don't hold data at all).</blockquote>
[http://docs.opendap.org/index.php/DAP_4.0_Design DAP4 Design].


<blockquote>
== Introduction ==
As an alternative, suppose we build a data response using the DDX in a multipart document and then encoded type information in the data stream as well? <s>This would provide a way to bundle attributes with variables in the data response and locate type information with the data values (for sequences mostly).</s> <font color="red">It's not possible to return the Attributes with a constrained DDX because the constraint can alter the attributes in ways that cannot be computed without semantic knowledge about the attribute.</font>
A DataDDX response is the way DAP4 returns data to a client. Each DataDDX response is returned over the wire as a multipart MIME document where the first part contains the DDX describing the data requested and the second and later parts contains a binary encoding of the requested data or error information.  
</blockquote>


[[DAP_4.0_Design#DataDDX|See also the Dap 4.0 Design for the DDX]] Note that there are some contradictions between the two documents.
See the [[DAP4: DDX Grammar | DAP4 DDX Grammar]] document and the
[[DAP4: DDX Lexical Elements | DAP4 Lexical Elements]] document
for the DDX syntax and lexical structure respectively. See the [[DAP4: DAP4 On the Wire Format | on-the-wire format]] document for the format of transmitted data.


== Questions to consider ==
For references to the Multipart MIME specification, see
[http://www.ietf.org/rfc/rfc2387.txt The MIME Multipart/Related Content-type (rfc 2387)] and [http://www.ietf.or/rfc/rfc1521.txt MIME part one].


# Will we be able to implement it efficiently?
==== Organization of the multipart MIME document ====
# How can we get those pesky reliable error messages/objects in there?
# What will trigger our server to send this response?


== Normative References ==
Here's what the shell of the document looks like:


[http://tools.ietf.org/html/rfc2112 Multipart MIME]
[http://www.w3.org/TR/xlink/ Xlink]
== Using Multipart MIME for the DataDDX response ==
The DataDDX response's network representation will be as a Multipart MIME document with two parts: One part that contains a DDX that contains zero or more variables; and one part that contains zero or more bytes of XDR-encoded data which corresponds to the variables declared in the DDX. The ''Data'' element in the DDX holds an xlink reference to this second part.
The DataDDX may be empty to account for cases where a dataset contains only type definitions, something that never happens now but which is an emerging feature of both HDF5 and NetCDF4.
The DataDDX will always have two parts, even if the second 'data' part is empty so that processing software can always assume that a DataDDX will occupy two parts of a multipart MIME document.
The ''Data'' element is used to link the DDX, in one part, to the data values, in another part, so that other protocols (e.g., DAP-SOAP) can package several responses in one document easily. That is, while this design does not provide for that capability, it is easily extensible to one that does.
=== Adding the ''Data'' element ===
In addition to the multipart MIME document that holds the two parts of the data response, the ''Data'' element holds a reference to the 'data part' of the response. Here's a sample ''Data'' element:
<pre>
<Data xmlns:xlink="http://www.w3.org/XML/1999/xlink"
      xlink:href="cid:6efa6ea4:98eda872192:-1ed1" xlink:type="simple"/>
</pre>
=== Example DataDDX response Sent via HTTP 1.1 ===
Note that this example shows the DataDDX being returned using HTTP/1.1. In past versions of DAP important information was encoded in the HTTP response headers. In this example the key information, that this response conforms to DAP version 3.2 is encoded both in the response header and the DDX response element using the ''dap-version'' attribute. This makes the DataDDX more friendly toward applications which use non-HTTP transport protocols.
<font size="2">
<source lang="xml">
<source lang="xml">
HTTP/1.1 200 OK
Content-Type: multipart/related; type="text/xml"; start="<<start id>>"; boundary="<<boundary>>"
Server: Apache-Coyote/1.1
Content-Type: multipart/related; type="text/xml"; start="<080B6DC4AC8AF0C43041C57CE8DE9646>"; boundary="--mimepart_7_9651610.1145395859678"
Date: Tue, 18 Apr 2006 21:30:59 GMT
XDAP: 3.2
Connection: close
   
   
--mimepart_7_9651610.1145395859678
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Transfer-Encoding: binary
Content-Id: <080B6DC4AC8AF0C43041C57CE8DE9646>
Content-Description: ddx
Content-Id: <<start-id>>
<?xml version="1.0" encoding="UTF-8"?>
    <<DDX here>>
--<<boundary>>
    <Dataset
Content-Type: application/x-dap-little-endian
              xmlns:xml="http://www.w3.org/XML/1998/namespace" 
Content-Transfer-Encoding: binary
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Content-Description: data
              xsi:schemaLocation="http://xml.opendap.org/ns/DAP/3.2#  http://xml.opendap.org/dap/dap/3.2.xsd"
Content-Id: <<unique id for this piece of binary data>>
              xmlns="http://xml.opendap.org/ns/DAP/3.2#"
Content-Length: <<-1 or the size in bytes of the binary data>>
              xmlns:dap="http://xml.opendap.org/ns/DAP/3.2#"
    <<XDR encoded binary data, part 1>>
              xml:base="http://test.opendap.org/dap/data/nc/fnoc1.nc.ddx"
--<<boundary>>
              dap_version="3.2"
Content-Type: application/x-dap-little-endian
              name="fnoc1.nc">
Content-Transfer-Encoding: binary
 
Content-Description: data
        <Data xmlns:xlink="http://www.w3.org/XML/1999/xlink"
Content-Id: <<unique id for this piece of binary data>>
              xlink:href="cid:6efa6ea4:98eda872192:-1ed1" xlink:type="simple"/>
Content-Length: <<-1 or the size in bytes of the binary data>>
 
  <<XDR encoded binary data, part 2>>
        <Attribute name="NC_GLOBAL" type="Container">
...
    <Attribute name="base_time" type="String">
--<<boundary>>
        <value>&quot;88- 10-00:00:00&quot;</value>
Content-Type: application/x-dap-little-endian
    </Attribute>
Content-Transfer-Encoding: binary
    <Attribute name="title" type="String">
Content-Description: data
        <value>&quot; FNOC UV wind components from 1988- 10 to 1988- 13.&quot;</value>
Content-Id: <<unique id for this piece of binary data>>
    </Attribute>
Content-Length: <<-1 or the size in bytes of the binary data>>
</Attribute>
    <<XDR encoded binary data, part n>>
<Attribute name="DODS_EXTRA" type="Container">
--<<boundary>>
    <Attribute name="Unlimited_Dimension" type="String">
        <value>&quot;time_a&quot;</value>
    </Attribute>
</Attribute>
<Array name="v">
    <Attribute name="units" type="String">
<value>&quot;meter per second&quot;</value>
    </Attribute>
            <Attribute name="long_name" type="String">
<value>&quot;Vector wind northward component&quot;</value>
    </Attribute>
    <Attribute name="missing_value" type="String">
<value>&quot;-32767&quot;</value>
    </Attribute>
    <Attribute name="scale_factor" type="String">
<value>&quot;0.005&quot;</value>
    </Attribute>
    <Int16/>
    <dimension name="time_a" size="16"/>
    <dimension name="lat" size="17"/>
    <dimension name="lon" size="21"/>
</Array>
    </Dataset>
</xml>
 
--mimepart_7_9651610.1145395859678
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Id: 6efa6ea4:98eda872192:-1ed1
 
  Here be the XDR encoded binary stuff that is the data from the GetDATA request
 
--mimepart_7_9651610.1145395859678--
</source>
</source>
</font>
== Encoding the binary data ==
How we encode the binary data will determine if a client can reasonably be expected to recognize when the data serialization code has stumbled and been forced to send an error message in place of the data. Another issue we might look at is the current cumbersome way of encoding Sequence data - it places a heavy burden on the server and has always had significant issues (i.e., it's broken).
=== Reliable Errors in the Hyrax Response ===
The BES and OLFS communicate using data transmissions that are ''chunked''. That means that the BES first sends a seven-byte ASCII-HEX ''byte count'' and ''control character'' of ''d'' for data (for a total of eight bytes) followed by that number of bytes of data, zero or more times followed by a ''byte count/control character'' sequence of 0x0000000d (seven ASCII digits plus a ''d''). If the BES encounters an error it sends a ''byte count'' and ''control character'' that is out of band control information and then the error in a following data block. The chunking scheme is described on the Trac site in [http://scm.opendap.org:8090/trac/wiki/BES_Chunking BES Chunking].


What currently happens in the OLFS to suppoprt DAP2 responses is that the ''byte count'' is read, then the following block of data is read and passed onto the client, then the next ''byte count'' is read, ..., until the 0x0000000d is read signaling the end of the document. That is the chunked nature of the data transmission is stripped so that the data Hyrax now (DAP 2.0 to 3.2) sends is not chunked.  
The example shows multiple sets of MIME headers separated by ''--<<boundary>>'' lines; The final boundary line terminates the document. The first group of headers (in a real response, there would be other headers here like Date, XDAP, and others) provide information need to recognize the boundary separators. The payload of that first data part contains references to the related parts using the values of their Content-Id headers (See [[DAP4: DAP4 On the Wire Format|here]]).


I would propose that for DAP 3.2 (or 3.3?) we make it so that those byte counts are passed onto the client in the data part of the DataDDX response. This will allow Hyrax to send errors and other out-of-band information to clients in a way that client can actually use. See [[Hyrax_-_BES_PPT#PPT_Chunking | Chunking]] for a description; it's fairly simple.
==== Choosing values for the DataDDX Content-Ids and Boundaries ====


=== Encoding Sequence Data ===
We would like the software that builds these DataDDX responses to be compatible with as many different transport protocols as possible, so long as the cost to the implementation for which we know we must support is low. One thing that some transport protocols may do is combine several DataDDX responses into a single document and, while the specifics of that will vary between protocols, one choice we can make now that will facilitate that is to ensure that the values of the Content-Ids and <<boundary>>s are unique within and across systems. This will free software that combines DataDDX responses from having to process the DDX and Content-Id header to  ensure that no name collisions are present. While using UUIDs, for example, makes the result values 'ugly', it adds virtually nothing to the time needed to build or process the responses. Other schemes, that combine a URI with some system-generated token could also be employed. The important point is to ensure that these symbols are unique not only within a system, but across systems.


Adopt the suggestion that type information be included in the data stream. This will duplicate some (small) amount of information, but make the software to decode the data easier to write, including particularly, nested sequences.
[[User:Ndp|ndp]] 12:42, 30 March 2010 (PDT)
[[User:dmh|Dennis Heimbigner]] Modified 5/7/2012.


[[Category:Development|DataDDX]][[Category:DAP4|DataDDX]]
===== Regarding Content-Type =====
For the data-part of the response, the value of the Content-Type header will be x-dap-<big|little>-endian. This will make it easy to determine the byte-order of the data BLOB that follows without actually reading any of that BLOB. Note that the BLOB will have the byte-order encoded in in as well, making one or the other redundant, but that will add essentially no cost to the server's and simplify clients (because they will be able to use either to determine the response byte-order).

Latest revision as of 20:36, 20 August 2012

<< Back to OPULS Development

Note: This document is a revision of material from DAP4 Design.

Introduction

A DataDDX response is the way DAP4 returns data to a client. Each DataDDX response is returned over the wire as a multipart MIME document where the first part contains the DDX describing the data requested and the second and later parts contains a binary encoding of the requested data or error information.

See the DAP4 DDX Grammar document and the DAP4 Lexical Elements document for the DDX syntax and lexical structure respectively. See the on-the-wire format document for the format of transmitted data.

For references to the Multipart MIME specification, see The MIME Multipart/Related Content-type (rfc 2387) and MIME part one.

Organization of the multipart MIME document

Here's what the shell of the document looks like:

Content-Type: multipart/related; type="text/xml"; start="<<start id>>";  boundary="<<boundary>>"
 
--<<boundary>>
Content-Type: text/xml; charset=UTF-8
Content-Transfer-Encoding: binary
Content-Description: ddx
Content-Id: <<start-id>>
    <<DDX here>>
--<<boundary>>
Content-Type: application/x-dap-little-endian
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<unique id for this piece of binary data>>
Content-Length: <<-1 or the size in bytes of the binary data>>
    <<XDR encoded binary data, part 1>>
--<<boundary>>
Content-Type: application/x-dap-little-endian
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<unique id for this piece of binary data>>
Content-Length: <<-1 or the size in bytes of the binary data>>
   <<XDR encoded binary data, part 2>>
...
--<<boundary>>
Content-Type: application/x-dap-little-endian
Content-Transfer-Encoding: binary
Content-Description: data
Content-Id: <<unique id for this piece of binary data>>
Content-Length: <<-1 or the size in bytes of the binary data>>
    <<XDR encoded binary data, part n>>
--<<boundary>>

The example shows multiple sets of MIME headers separated by --<<boundary>> lines; The final boundary line terminates the document. The first group of headers (in a real response, there would be other headers here like Date, XDAP, and others) provide information need to recognize the boundary separators. The payload of that first data part contains references to the related parts using the values of their Content-Id headers (See here).

Choosing values for the DataDDX Content-Ids and Boundaries

We would like the software that builds these DataDDX responses to be compatible with as many different transport protocols as possible, so long as the cost to the implementation for which we know we must support is low. One thing that some transport protocols may do is combine several DataDDX responses into a single document and, while the specifics of that will vary between protocols, one choice we can make now that will facilitate that is to ensure that the values of the Content-Ids and <<boundary>>s are unique within and across systems. This will free software that combines DataDDX responses from having to process the DDX and Content-Id header to ensure that no name collisions are present. While using UUIDs, for example, makes the result values 'ugly', it adds virtually nothing to the time needed to build or process the responses. Other schemes, that combine a URI with some system-generated token could also be employed. The important point is to ensure that these symbols are unique not only within a system, but across systems.

ndp 12:42, 30 March 2010 (PDT) Dennis Heimbigner Modified 5/7/2012.

Regarding Content-Type

For the data-part of the response, the value of the Content-Type header will be x-dap-<big|little>-endian. This will make it easy to determine the byte-order of the data BLOB that follows without actually reading any of that BLOB. Note that the BLOB will have the byte-order encoded in in as well, making one or the other redundant, but that will add essentially no cost to the server's and simplify clients (because they will be able to use either to determine the response byte-order).