DAP4: Overview: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
No edit summary
mNo edit summary
Line 26: Line 26:
=== Atomic Types, Container Types and Enumeration Types ===
=== Atomic Types, Container Types and Enumeration Types ===


The type of any variable, whether or not it is an array, is one of DAP's atomic types, a container type or an enumerated type as described in the following paragraphs.
The specified type of any DAP variable, whether or not it is an array, must be an atomic types, a container type, or an enumerated type as described in the following paragraphs.


Atomic types...
Atomic types...

Revision as of 04:56, 26 March 2014

Following two decades of stability and increasing use, DAP2 is being superseded by DAP4, the first substantive revision in the history of the Data Access Protocol (DAP), an open-source endeavor led by OPeNDAP, Inc. The primary and continuing purpose of DAP is to realize remote, selective, data-retrieval as a widely-accepted and well-crafted Web service. This document outlines the fundamental concepts of DAP4, and (targeting those who have already programmed DAP-compatible clients and servers) it highlights how DAP4 differs from DAP2. In the following, DAP refers to DAP4 unless indicated otherwise.

Data Retrieval as a Web Service

The premise underlying DAP4 remains, as in DAP2, that values from data sources—or, notably, from proper subsets—along with pertinent metadata may be acquired remotely and effectively through an appropriately defined Web service, operated near the source data. To a surprising degree, DAP services shield users from idiosyncrasies in source-data formats and storage, so DAP functions as middleware with a further advantage: source-data and users may reside anyplace that has Internet connectivity. OPeNDAP's commitment to open source has fostered several DAP-compatible servers and an even larger number of DAP-compatible client environments, several of which (i.e., servers, clients and client-server libraries) are available at no cost.

DAP is designed for selectively retrieving (but not for storing) data organized as variables or groups of variables. It is well suited to cases where client computers retrieve data stored on remote computers (i.e., servers) networked to the client, especially where source data sets are large but clients typically need only small subsets of them. The protocol is fundamentally stateless (some might say “RESTful”), and it governs how clients pose requests and how servers issue corresponding responses.

DAP’s effectiveness is keyed on the underlying data model. This embraces a rich variety of data types (including multidimensional arrays) and spells out the (type-specific) retrieval operations that clients may request. The simplicity, flexibility and domain-neutrality of the DAP data model (which bears much similarity to that of DAP2) make it effective—as middleware, per the above—across a wide variety of data types and domains. More specifically, a wide variety of data sources, with a wide variety of data schemas, can be mapped onto the DAP model for retrieval and use by client computers and software.

The DAP Data Model

Each DAP server makes accessible a collection of source data sources, each identified by a unique (unadorned) URL. As discussed below, clients pose requests by modifying this URL with DAP-specific suffixes and query strings. The following subsections only summarize the formal specification, which takes precedent over anything stated here.

Elements of a DAP Data Source

A DAP data source is fundamentally a collection of variables, which have names, types, dimensions, attributes, and values; attributes and dimensions also may be named. The allowable types are outlined in the ensuing subsection. A variable with several dimensions is a natural and intuitive way to represent multidimensional arrays, and the DAP repertoire of client requests (see the subsection below on that topic) includes ways to retrieve user-specified subarrays.

An attribute is much like a variable except that its purpose is to enable interpretation of the variable to which it is assigned; in contrast, variables contain the primary content of a data source. The scope of an attribute is limited by the variable to which it is assigned; thus, for example, an attribute named Units may be assigned to variables T and V, carrying the distinct values "K" and "m/s" respectively. In contrast, dimensions are essentially named constants, so their values (always integers) are completely independent of the variables with which they are associated.

Variables and their attributes may be collected into named groups (which can be nested to yield hierarchies), and variable names may be reused in multiple groups without generating conflicts. For example, a variable named V appearing in a group named G1 is understood to be distinct from and unrelated to a variable named V appearing in a second group named G2. Dimensions may not be assigned to groups, as their scope is always global, as indicated above.

[?insert a table or tables showing how the above elements (i.e., groups, variables, types, dimensions, shapes relate to one another?]

Atomic Types, Container Types and Enumeration Types

The specified type of any DAP variable, whether or not it is an array, must be an atomic types, a container type, or an enumerated type as described in the following paragraphs.

Atomic types...

Container types...

Enumeration types...

Requests that May Be Invoked on DAP Data Sources

[more to come here, laying out the DAP concept in a manner that's accessible to those completely unfamiliar...]

Note: Though adoption to-date has been most pronounced in Earth sciences, DAP’s data types and structures are not at all specific to these disciplines, so we believe DAP4 is positioned for effective use in many domains, scientific and otherwise.

The Formal DAP Specification

The DAP4 specification spans two volumes: one describes the Data Model and DAP’s Request/Response objects; the other volume describes how DAP clients and servers communicate via HTTP and the modern Web. New volumes about DAP Extensions will be added as they emerge.

Partitioning the specification into two primary documents reflects the independence of DAP’s data-retrieval functionality from the underlying network transfer protocol. Indeed, DAP could (via extensions) be used with other transports. However, utilizing HTTP eases the building of DAP servers because they can take full advantage of widely used Web-server frameworks such as Apache. Use of Extensions documents will enable evolution of the protocol without the expense and complexity of another major protocol-development project. Anticipated extensions include a JSON encoding for DAP data/metadata and the provision of server functions (beyond DAP’s core subsetting and filtering operations).

[?should we insert here a partial table of contents (with active links) for volume I?]

[?should we insert here a partial table of contents (with active links) for volumes II?]

How DAP4 Differs from DAP2

Though the protocol, per se, is maintained primarily by OPeNDAP, many others have engaged in DAP2 realization. One implementation—by Unidata, in the University Corp. for Atmospheric Research—includes the popular THREDDS Data Server (TDS). A key motivation for DAP4, developed jointly by OPeNDAP and Unidata (see "Acknowledgments," below), was to reduce differences that have arisen, and impede interoperability, among DAP2 realizations. Our hope is that a modernized, clearer and more comprehensive specification will facilitate building clients and servers with greater interoperability, making such ventures more rewarding and less risky.

This section covers changes to the data model, response formats, and serialization, giving developers a roadmap to migration from DAP2 to DAP4. E.g., the “Grid” type now supports a notion of discrete functions similar to an OGC (or ISO) Coverage and to the Scientific Data Type found in Unidata’s Common Data Model (CDM). Also from this section, users may learn of functionalities to seek in clients. E.g., DAP4 servers return checksums with each data response, but clients may utilize these in varying degrees.

DAP4 is largely an extension of DAP2 concepts, including ideas that emerged as DAP gained prominence across the Earth sciences. Therefore DAP2-compatible software, in clients or servers, should be easy to adapt to DAP4, and this has been affirmed in the OPeNDAP-Unidata realization and testing work. Furthermore, DAP4 exhibits backward compatibility sufficient to enable gradual transitioning. Substantive changes include support for Groups, yielding greater compatibility with HDF and NetCDF4.

[most or all of the (as yet unedited) material below will be folded into subsections here, probably including:]

Data Model

Responses

Response Encoding

Acknowledgments

DAP4 is the result of a joint, multiyear development effort by OPeNDAP and Unidata, funded by a generous grant from NOAA and guided by an advisory committee comprising Mike Folk (THG), Jim Frew (UCSB), Steve Hankin (NOAA), Eric Kihn (NOAA), Chris Lynnes (NASA) and Rich Signell (USGS).


____unedited material____

DAP4 and Data Access

Data Model

Summary: DAP4 supports generalized coverages and Groups

The DAP4 data model is fundamentally the same as with DAP2. Data are characterized as a collection of variables, each of which has a type, a name and one or more values. As with many programming languages and with DAP2, the types include Bytes, Integers (now including 64-bit integers), Floating point values, Strings, URLs, Structures and Sequences. We have added some new types in DAP4: Enumeration; Opaque; and Group. In addition, we have added Shared Dimensions that serve to indicate relations between different arrays which can be used to build/represent Coverages. In DAP4, Coverages provide a more comprehensive replacement for Grids, with the latter removed from DAP4.

In addition to variables, each data set can contain an arbitrary number of attributes and an arbitrary number of Groups. Attributes are a binding of name, type and value like a variable but are intended to hold metadata about the dataset and about each variable it contains. Groups provide a way to organize collections of variables and to encode these kinds of relationships when they are present in the underlying data store.

Migrating from DAP2 to DAP4

For servers: A DAP2 DDS/DAS (or DDX) is very close to a DAP4 DMR. The set of datatypes supported by DAP4 is almost a proper superset of those in DAP2, the exception being that DAP2's Grid type has been removed and in its place is a Coverage. A Coverage is not a type per se, instead it is a binding of two or more arrays using Shared Dimensions. Thus, to transform a DAP2 Grid into a Coverage for DAP4, the dimensions from the Grid's Maps will have to be extracted and used to make Shared Dimensions in the DMR. However, the DAP4 Coverage model completely subsumes DAP2 Grids, so it will be easy to represent Grids in DAP4.

For clients: Some of the new data types are more challenging to implement than the types included with DAP2. Of particular note are Enumerations and Coverages.

Responses

Summary:

  • DAP4 includes only one dataset metadata response, not two;
  • Several Sequences may be individually constrained in one access;
  • Predictable behavior for URLs
  • Asynchronous responses

In DAP4 these is a single XML document that encodes the metadata for a data source. This response is conceptually similar to, and in some ways identical too, the DDX response that is supported by many DAP2 servers, so it's organization will be familiar to many people already. As with DAP2, there us one data response that can be modified (constrained) using a expression to limit the information it includes. The basic concepts of slicing an array are present using the same essential notation. We've taken care to allow for servers to extend this, some that is covered in a bit ore detail below under web services. We have replaces the selection part of the DAP2 constraint expression with a filter sub-expression that is applied to a specific variable. this enable two or more Sequences to have different filtering operations applied (before that was not possible). Our expanded constraint language also provides a way to subset coverages and a proposed extension to the filtering sub-expression provides a way to subset arrays/coverages by value.

We wanted DAP4 to fully embrace REST. DAP2, even though it predates the term, including many, but not all, of the REST architecture's features. One change from DAP2 was to explicitly define what happens when a client dereferences a 'bare URL' (one without an extension used to ask for a specific DAP4 response. When a DAP4 sever is asked to return information at a bare URL, the result is a Dataset Services Response (DSR) which contains links to all of the other responses for that dataset. In addition, the DSR may contain other information such as server operations that can be used with the dataset (and maybe only with the particular dataset). The DSR is an XML document but can contain a stylesheet that transforms it to HTML for a web browser.

DAP4 servers also support asynchronous access to data, which enables access to data in near-line devices and can be used for some server processing operations (e.g., operations that take a long time to perform). Asynchronous access it accomplished by combining a switch in the request that informs the server that the client knows the request may not have an immediate response with a response that contains a URL to a response that will be ready in the future instead of the response itself.

Migrating from DAP2 to DAP4

  • If your server or client already reads DAP2 DDX responses (which were never part of the official protocol but are widely used) then adapting to the DMR will be very easy since they are very close in structure.
  • Support for the new constraints may take a bit more work since now the Constraint Expression a Server Functions have been separated.
  • Clients will benefit from asynchronous response support, but this is a new behavior and may take some serious thought, particularly for clients that relied on the simpler semantics borrowed from file system accesses.

Response Encoding

Summary:

  • Checksums for data values;
  • Reliable delivery of error messages to clients;
  • Encode data using the server's native word order.

We have added three changes to the encoding of returned data values. All top-level variables in a data response now include a CRC32 checksum of their values. This enables people to see if the same request is returning the same data values (maybe the data have been changed?). The checksum values are encoded in Attributes bound to the returned variables. We have add an encoding scheme for data values that preserves compactness yet allows clients to easily detect when a server has encountered an error while sending a response. Similarly, we have adopted a Reader Make Right encoding scheme instead of the network byte order scheme used by DAP2. The latter has become more and more important as the predominance of little-endian processors has increased.

Migrating from DAP2 to DAP4

In many ways the encoding scheme is simpler for servers because the data response uses the server's native byte order. Clients must detect the byte order and twiddle bytes as needed. However, the server must correctly implement the chunking protocol used by the data response and must correctly computer CRC32 checksums for each of the top level variables.

How DAP4 Works with HTTP

Summary: DAP4 comes closer to the REST (Representational state transfer) architecture and uses HATEOS (hypermedia as the engine of application state) making all of the server's responses explicit via links in a document.

While DAP2 interwove the DAP and HTTP, using, for example, some of the HTTP headers as the only source of information that was critical to the DAP itself, DAP4 does not. Instead, DAP4 is completely isolated from HTTP, enabling it to work with other protocols without change. This does not mean that DAP4 does not use HTTP, only that it does not rely on it, making it simple to implement DAP4 servers that use a different protocol for transport (AMQP, et c.). However, in as much as HTTP is a ubiquitous network transport protocol, the DAP4 specification includes a volume devoted solely to how a server should implement DAP4 using HTTP.

The REST interface for the protocol is described in Volume 2, Web Services, of the specification. DAP4 requires that a server implement at least three responses for each dataset: The DSR; DMR; and Data response. The DSR is a XML document that provides a capabilities response for the dataset. This document provides links to all of the other responses available for the dataset, along with other information. The DSR provides information about alternative encodings for the different responses in addition to enumerating the basic responses themselves. The DSR may also list server functions that may be used with/on the dataset.

DAP4 servers are encouraged to support HTTP content negotiation, providing the standard DSR, DMR and Data responses in a variety of forms.

Migrating from DAP2 to DAP4

The web service for DAP4 will likely need to be written from scratch, but the good news is that those are easy to write. For clients, the behavioral differences between DAP2 and DAP4 servers are small, with two exceptions. Since DAP4 supports asynchronous responses, clients will need to be modified to access data available only using this new feature. DAP4 also supports content negotiation and that means a larger number of ways to get the different responses (even though each protocol has three basic responses).