DataDDX: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 130: Line 130:
<blockquote>
<blockquote>
[[Users:Ndp|ndp]] 13:50 12 December 2008<br/>
[[Users:Ndp|ndp]] 13:50 12 December 2008<br/>
Is the proposal that we chunk the data? What is the suggestion here? And how did we suddenly get involved in layering combining an asynchronous protocol into what should be simple chunking? If we want asynchronous protocol then let's adopt AMQP and quit reinventing the wheel.
Is the proposal that we chunk the data? What is the suggestion here? How did this feature creep of combining an asynchronous protocol into what should be simple chunking start? If we want asynchronous protocol then let's adopt AMQP and quit reinventing the wheel.
</blockquote>
</blockquote>



Revision as of 21:53, 12 December 2008

  • Use a multipart MIME document to hold the DDX and one data blob per DDX. Adopt the same use of Multipart. MIME as WCS uses.
  • consider the following design idea for embedding type information within the data stream:

In DAP2, the DDS is used as a descriptive header for data. For gridded data that works OK but for point data it doesn't work so well. When the data are read as from a stream, it's OK, but when the data are read and stored, then used, they need to be in a linked structure. Because the DDS contains the data type definition and not the value, it does not hold a structure suitable for holding values. Look at how the d_values field in Sequence works (see Sequence::print_val(), deserialize()). A better approach would be to encode the type of a variable in the data stream itself and have the reader (client most of the time) build instances as needed. For arrays, structures and simple types this is mostly a wash, but for Sequences it would be a major plus because the protocol could support sequences of complex objects more easily. It would also get rid of the odd situation where a DDS holds a type definition for a nested sequence while the top-most sequence holds the tree of objects which hold values. (I.E. the child sequences in the DDS don't hold data at all).

As an alternative, suppose we build a data response using the DDX in a multipart document and then encoded type information in the data stream as well? This would provide a way to bundle attributes with variables in the data response and locate type information with the data values (for sequences mostly).

Questions to consider

  1. Is it stupid?
  2. Will we be able to implement it efficiently?
  3. How can we get those pesky reliable error messages/objects in there?
  4. What will trigger out server to send this response?

Normative References

Multipart MIME

Xlink

Using Multipart MIME for the DataDDX response

The DataDDX response's network representation will be as a Multipart MIME document with two parts: One part that contains a DDX that contains zero or more variables; and one part that contains zero or more bytes of XDR-encoded data which corresponds to the variables declared in the DDX. The Data element in the DDX holds an xlink reference to this second part.

The DataDDX may be empty to account for cases where a dataset contains only type definitions, something that never happens now but which is an emerging feature of both HDF5 and NetCDF4.

The DataDDX will always have two parts, even if the second 'data' part is empty so that processing software can always assume that a DataDDX will occupy two parts of a multipart MIME document.

The Data element is used to link the DDX, in one part, to the data values, in another part, so that other protocols (e.g., DAP-SOAP) can package several responses in one document easily. That is, while this design does not provide for that capability, it is easily extensible to one that does.

Adding the Data element

In addition to the multipart MIME document that holds the two parts of the data response, the Data element holds a reference to the 'data part' of the response. Here's a sample Data element:

<Data xmlns:xlink="http://www.w3.org/XML/1999/xlink" 
      xlink:href="cid:6efa6ea4:98eda872192:-1ed1" xlink:type="simple"/>

Example DataDDX response Sent via HTTP 1.1

Note that this example shows the DataDDX being returned using HTTP/1.1. In past versions of DAP important information was encoded in the HTTP response headers. In this example the key information, that this response conforms to DAP version 3.2 is encoded both in the response header and the DDX response element using the dap-version attribute. This makes the DataDDX more friendly toward applications which use non-HTTP transport protocols.

 HTTP/1.1 200 OK
 Server: Apache-Coyote/1.1
 Content-Type: multipart/related; type="text/xml"; start="<080B6DC4AC8AF0C43041C57CE8DE9646>"; boundary="--mimepart_7_9651610.1145395859678"
 Date: Tue, 18 Apr 2006 21:30:59 GMT
 XDAP: 3.2
 Connection: close
 
 --mimepart_7_9651610.1145395859678
 Content-Type: text/xml; charset=UTF-8
 Content-Transfer-Encoding: binary
 Content-Id: <080B6DC4AC8AF0C43041C57CE8DE9646>
 
 <?xml version="1.0" encoding="UTF-8"?>
 
     <Dataset 
              xmlns:xml="http://www.w3.org/XML/1998/namespace"  
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://xml.opendap.org/ns/DAP/3.2#  http://xml.opendap.org/dap/dap/3.2.xsd"
              xmlns="http://xml.opendap.org/ns/DAP/3.2#"
              xmlns:dap="http://xml.opendap.org/ns/DAP/3.2#"
              xml:base="http://test.opendap.org/dap/data/nc/fnoc1.nc.ddx"
              dap_version="3.2"
              name="fnoc1.nc">

        <Data xmlns:xlink="http://www.w3.org/XML/1999/xlink" 
              xlink:href="cid:6efa6ea4:98eda872192:-1ed1" xlink:type="simple"/>

        <Attribute name="NC_GLOBAL" type="Container">
	     <Attribute name="base_time" type="String">
	         <value>"88- 10-00:00:00"</value>
	     </Attribute>
	     <Attribute name="title" type="String">
	         <value>" FNOC UV wind components from 1988- 10 to 1988- 13."</value>
	     </Attribute>
	 </Attribute>
	 <Attribute name="DODS_EXTRA" type="Container">
	     <Attribute name="Unlimited_Dimension" type="String">
	         <value>"time_a"</value>
	     </Attribute>
	 </Attribute>
	 <Array name="v">
	     <Attribute name="units" type="String">
		 <value>"meter per second"</value>
	     </Attribute>
             <Attribute name="long_name" type="String">
		 <value>"Vector wind northward component"</value>
	     </Attribute>
	     <Attribute name="missing_value" type="String">
		 <value>"-32767"</value>
	     </Attribute>
	     <Attribute name="scale_factor" type="String">
		 <value>"0.005"</value>
	     </Attribute>
	     <Int16/>
	     <dimension name="time_a" size="16"/>
	     <dimension name="lat" size="17"/>
	     <dimension name="lon" size="21"/>
	 </Array>
     </Dataset>
 </xml>

 --mimepart_7_9651610.1145395859678
 Content-Type: application/octet-stream
 Content-Transfer-Encoding: binary
 Content-Id: 6efa6ea4:98eda872192:-1ed1
   
   Here be the XDR encoded binary stuff that is the data from the GetDATA request
   
 --mimepart_7_9651610.1145395859678--

Encoding the binary data

How we encode the binary data will determine if a client can reasonably be expected to recognize when the data serialization code has stumbled and been forced to send an error message in place of the data. Another issue we might look at is the current cumbersome way of encoding Sequence data - it places a heavy burden on the server and has always had significant issues (i.e., it's broken).

Reliable Errors in the Hyrax Response

The BES and OLFS communicate using data transmissions that are chunked. That means that the BES first sends a seven-byte ASCII-HEX byte count and control character of d for data (for a total of eight bytes) followed by that number of bytes of data, zero or more times followed by a byte count/control character sequence of 0x0000000d (seven ASCII digits plus a d). If the BES encounters an error it sends a byte count and control character that is out of band control information and then the error in a following data block. The chunking scheme is described on the Trac site in BES Chunking.

What happens in the OLFS is that the byte count is read, then the following block of data is read and passed onto the client, then the next byte count is read, ..., until the 0x0000000 is read signaling the end of the document. That is the chunked nature of the data transmission is stripped so that the data Hyrax now (DAP 2.0 to 3.2) sends is not chunked.

I would propose that for DAP 3.2 (or 3.3?) we make it so that those byte counts are passed onto the client in the data part of the DataDDX response. This will make it easier to have responses other than data and error. For example, we might indicate that the data will be delivered using some form of asynchrony.

ndp 13:50 12 December 2008
Is the proposal that we chunk the data? What is the suggestion here? How did this feature creep of combining an asynchronous protocol into what should be simple chunking start? If we want asynchronous protocol then let's adopt AMQP and quit reinventing the wheel.

Alternative: If an error is part of the BES' transmission, Hyrax should deviate a bit from the protocol used by the BES and OLFS in that there's only way to indicate an error - with an error object. So the byte count should be zero (0x0000000) and the control character should be 'x' or 'e' or something other than 'd' (which is used for data) and the remaining bytes should be the error object. The error response would be encoded using XML.

The alternative approach is probably easier for clients to implement while the first approach is probably easier for us to code and has the advantage that we might be able to use the chunking system for other uses.

Second alternative: If anything other than data is to be sent, use 0x0000000<char> where <char> is x for an error, a for an asynchronous response, et c., where that error, URL, et c., immediately follows the byte count and control character information with the rationale that these are all much much smaller than a data transmission and thus don't need chunking. At the same time, the client needs reading these to be as easy as we can make it. Finally, if we need to add a non-data response that needs chunking, we can still do that. I don't want client implementers to build a big machine for something that's not needed.

ndp 13:40 12 December 2008 (PST)
I don't see why either of these modifications is useful.

  1. If someone is going to develop using our libraries then error handling is already managed.
  2. If someone is going to go out and write a client from the ground up they are going to have to write a chunk stream reader for our data responses, so they already have the thing they need to read the chunked error body.

I think it is MUCH more confusing to have the protocol turning chunking on and off based on some interpretation of content - I think that would drive a client implementer crazy. Basically I see both of the "alternatives" listed as making things more complex and harder to understand.

Encoding Sequence Data