DAP4: DAP4 Checksum Changes

From OPeNDAP Documentation

<< Back to OPULS Development

[Updated: 3/5/2013 to reflect conversation with Ethan. See discussion section.]

1 Background

Currently, the specification says that checksums are computed on the contents of the top-level variables in the serialized response. I propose that we change when and where the checksum is computed so that it is computed at the grain of the data chunks instead of at the grain of variable.

2 Problem Addressed

Checksumming currently requires knowledge of the DMR (and potentially any constraints) so that the contents of "top-level" variables can be identified in the data serialization. Further, checksum errors are not detected until the whole variable has been read.

3 Proposed Solution

3.1 General Format

Each chunk, including the DMR meta-data chunk, has a 16 byte MD5 checksum preceding the contents of the chunk, but following the length count for the chunk. Thus, the general form of a chunk would be as follows:

--------------------------
| length + tags [4bytes] |
--------------------------
| checksum [16 bytes]    |
--------------------------
| data [length-16 bytes] |
--------------------------

The checksum is computed over the contents of the chunk, where the content is treated as uninterpreted bytes. This would also apply to the DMR chunk. The length field for each chunk includes the checksum as part of its length.

3.2 DMR Format

In keeping with our current format for the DMR chunk, all parts of this chunk are encoded as UTF-8 characters So the general DMR chunk format would be as follows

------------------------------------------
| 0xHHHHHHHH CRLF                         |
------------------------------------------
| 0xHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH CRLF |
------------------------------------------
| DMR in XML format CRLF                  |
------------------------------------------

Each 'H' stands for a hex digit. As in the general case, the length includes the checksum and the chunk content (i.e. the DMR).

4 Discussion

Computing the checksum at the chunk level simplifies processing because no knowledge of the DMR is required. The computation can be carried out completely in the chunk reading and writing code.

Additionally, computing the checksum at the chunk level can potentially detect errors more quickly than at the variable level. To see this, suppose that a variable crosses, say, three chunks, and there is an error in the first chunk. If variable level grain is used, three chunks will be read an processed before the checksum error is detected. If chunk grain is used, then the error will be detected as soon as the first chunk is read; the following two chunks need not be processed.

This proposal also introduces no new error processing complexity because a checksum error at the chunk level will generate an IOException error to the layers above. These layers must already be prepared to receive IOException errors such as when, for example, the client-server link is unexpectedly closed.

Dennis

It was my understanding that checksums were introduced into the specification not as a mechanism for insuring error free data transmittal (although that is a side benefit) but as a mechanism through which clients can detect changes (or the lack thereof) in data content. For example a data provider might reprocess their data holdings because more refined algorithms have become available, but not all variables in the dataset may be affected (thus the last-modified time of the holding is not necessarily the best indicator, at the variable level, of what's going on). If my understanding is accurate I'm thinking that this proposal fails to meet the requirements regarding change tracking at the variable level. If I am wrong, and checksums are all about error free transport, then at first read I think that this is a reasonable change.
ndp 15:44, 1 March 2013 (PST)

Update: 5/3/2013 Dennis
Nathan, you are correct. Ethan made a similar comment. However, in order to make this usable, we need to provide a standard mechanism by which the client can access the checksum. I propose that we add a specially named attribute in the DMR that holds the checksum value for a variable. I propose that the attribute be named "_<checksum algorithm>". So if we are using MD5, the attribute is called "_MD5" [see next update].

The question now becomes: do we need to add any kind of error checking checksumming or do rely on network reliability. Note that using a checksum only for change detection does not require the client to actually recompute the checksum; it only needs to display it to the user's code (via e.g. an attribute).

Update: 5/3/2013 Dennis
It is not obvious why we should be defaulting to MD5 as the checksum algorithm. If we were doing this to avoid man-in-the-middle spoofing, we should be using SHA1. If our goal is change detection (or even error detection), then we can get by with, say, CRC32, which is both smaller and faster than MD5. The corresponding attribute (see previous update) would be "_CRC32".