DAP4: DAP4 Checksum Changes: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 82: Line 82:


<blockquote>
<blockquote>
It was my understanding that checksums were introduced into the specification not as a mechanism for insuring error free data transmittal (although that is a side benefit) but as a mechanism through which clients can detect changes (or the lack thereof)  in data content. For example a data provieder might reprocess their data holdings because more refined algorithms have become available, but not all variables in the dataset may be affected (thus the last-modified time of the holding is not necessarily the best indicator, at the variable level, of what's going on). If my understanding is accurate I'm thinking that this proposal fails to meet the requirements regarding change tracking at the variable level. If I am wrong, and checksums are all about error free transport, then at first read I think that this is a reasonable change.
It was my understanding that checksums were introduced into the specification not as a mechanism for insuring error free data transmittal (although that is a side benefit) but as a mechanism through which clients can detect changes (or the lack thereof)  in data content. For example a data provider might reprocess their data holdings because more refined algorithms have become available, but not all variables in the dataset may be affected (thus the last-modified time of the holding is not necessarily the best indicator, at the variable level, of what's going on). If my understanding is accurate I'm thinking that this proposal fails to meet the requirements regarding change tracking at the variable level. If I am wrong, and checksums are all about error free transport, then at first read I think that this is a reasonable change.


</blockquote>
</blockquote>
[[User:Ndp|ndp]] 15:44, 1 March 2013 (PST)
[[User:Ndp|ndp]] 15:44, 1 March 2013 (PST)

Revision as of 23:57, 1 March 2013

<< Back to OPULS Development

Background

Currently, the specification says that checksums are computed on the contents of the top-level variables in the serialized response. I propose that we change when and where the checksum is computed so that it is computed at the grain of the data chunks instead of at the grain of variable.

Problem Addressed

Checksumming currently requires knowledge of the DMR (and potentially any constraints) so that the contents of "top-level" variables can be identified in the data serialization. Further, checksum errors are not detected until the whole variable has been read.

Proposed Solution

General Format

Each chunk, including the DMR meta-data chunk, has a 16 byte MD5 checksum preceding the contents of the chunk, but following the length count for the chunk. Thus, the general form of a chunk would be as follows:

--------------------------
| length + tags [4bytes] |
--------------------------
| checksum [16 bytes]    |
--------------------------
| data [length-16 bytes] |
--------------------------

The checksum is computed over the contents of the chunk, where the content is treated as uninterpreted bytes. This would also apply to the DMR chunk. The length field for each chunk includes the checksum as part of its length.

DMR Format

In keeping with our current format for the DMR chunk, all parts of this chunk are encoded as UTF-8 characters So the general DMR chunk format would be as follows

------------------------------------------
| 0xHHHHHHHH CRLF                         |
------------------------------------------
| 0xHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH CRLF |
------------------------------------------
| DMR in XML format CRLF                  |
------------------------------------------

Each 'H' stands for a hex digit. As in the general case, the length includes the checksum and the chunk content (i.e. the DMR).

Discussion

Computing the checksum at the chunk level simplifies processing because no knowledge of the DMR is required. The computation can be carried out completely in the chunk reading and writing code.

Additionally, computing the checksum at the chunk level can potentially detect errors more quickly than at the variable level. To see this, suppose that a variable crosses, say, three chunks, and there is an error in the first chunk. If variable level grain is used, three chunks will be read an processed before the checksum error is detected. If chunk grain is used, then the error will be detected as soon as the first chunk is read; the following two chunks need not be processed.

This proposal also introduces no new error processing complexity because a checksum error at the chunk level will generate an IOException error to the layers above. These layers must already be prepared to receive IOException errors such as when, for example, the client-server link is unexpectedly closed.

Dennis

It was my understanding that checksums were introduced into the specification not as a mechanism for insuring error free data transmittal (although that is a side benefit) but as a mechanism through which clients can detect changes (or the lack thereof) in data content. For example a data provider might reprocess their data holdings because more refined algorithms have become available, but not all variables in the dataset may be affected (thus the last-modified time of the holding is not necessarily the best indicator, at the variable level, of what's going on). If my understanding is accurate I'm thinking that this proposal fails to meet the requirements regarding change tracking at the variable level. If I am wrong, and checksums are all about error free transport, then at first read I think that this is a reasonable change.

ndp 15:44, 1 March 2013 (PST)