DAP Design: shared dimensions, groups and types: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 142: Line 142:
* Dimensions bound to a type define coordinate variables.
* Dimensions bound to a type define coordinate variables.


=== Dimension examples ===
==== Coordinate variables and Grids ====


Declaring dimensions in the DDX:
While dimensions are scoped at the Dataset or Group level, coordinate variables are defined at the level of a Grid object. Grid objects in DAP4 are different from those in DAP2 in two ways beyond using (shared) dimensions scoped at the Dataset/Group level:


<pre>
# Each Grid object may hold more than one ''Array'' (what is often a dependent variable); and
<Dataset name="dimension_ex_1" ...>
# Each Array within a Grid is not constrained to use all of the Grid's coordinate variables.
    <!-- The 'dimensions' section must come first if present -->
    <dimensions>
        <!-- Dimensions are declared like an array except that the
            'dimension' element is replaced by an 'extent' element
            and the extent has no name (since the dimension itself
            is named -->
        <Int32 name="latitude">
            <extent size="1024">
        </Int32>
        <String name="color">
            <extent size="3">
        </String>
    </dimensions>


    <!-- remainder of the document -->
N.B: ''Coordinate variables'' in a Grid object are called ''Maps'' to conform to the old nomenclature.
 
</Dataset>
</pre>


Using dimensions in the DDX. In DAP, dimensions can only be used in a Grid variable;
As with Grids in DAP2, each Grid object defines a lexical scope.


Here is a very simple Grid object:
<pre>
<pre>
<Dataset name="dimension_ex_1" ...>
<grid>
     <dimensions>
     <map name="lon" dim="lon" type="Float32"/>
        <Int32 name="latitude">
     <map name="lat" dim="lat" type="Float32"/>
            <dimension size="1024">
        </Int32>
        <Int32 name="longitude">
            <dimension size="1024">
        </Int32>
    </dimensions>
 
    <!-- The two declarations that follow are effectively Grids,
        but now share the same dimentions -->
     <Byte name="uwnd">
        <!-- note that only the name is used, not the size -->
        <map name="latitude">
        <map name="longitude">
    </Byte>
 
    <Byte name="vwnd">
        <map name="latitude">
        <map name="longitude">
    </Byte>
 
    ...


</Dataset>
    <array name="SST">
        <Byte/>
        <map name="lon">
        <map name="lat">
    </array>
</grid>
</pre>
</pre>


Here's an example that shows how Groups might be used to deal with the case where there are two different dimensions called 'longitude' and 'latitude'. Note that you cannot use ''Structure'' to do this because a Structure cannot be used to introduce dimensions.
Notes:
# The ''map'' object may have the same name as a ''dimension'' object.
# Map objects may have attributes.
# In an ''array'' object, ''<map...>'' elements are used to specify the array's dimensions; the word ''dimension'' is avoided to cut down on confusion.


A more complex Grid object:
<pre>
<pre>
<Dataset ...>
<dataset>
     <Group name="combined">
     <dim name="pt" size="4096">
        <dimensions>
            <Int32 name="latitude">
                <dimension size="1024">
            </Int32>
            ...
        </dimensions>


         <Byte name="uwnd">
    <grid>
            <map name="latitude">
         <map name="longitude" dim="pt" type="Float32"/>
            ...
        <map name="latitude" dim="pt" type="Float32"/>
         </Byte>
        <map name="altitude" dim="pt" type="Float32"/>
    </Group>
         <map name="time" dim="pt" type="Float32">
            << attributes >> <!-- The syntax for attributes is in flux -->
        </map>


    <Group name="raw">
        <array name="Radioactivity">
        <dimensions>
            << attributes >> <!-- for example, scale_factor and add_offset -->
             <Int32 name="latitude">
            <Byte/>
                <dimension size="2048">
             <map name="longitude"/>
             </Int32>
            <map name="latitude"/>
             ...
             <map name="altitude"/>
         </dimensions>
             <map name="time"/>
         </array>


         <Byte name="uwnd">
         <array name="surface_temp">
            <map name="latitude">
            << attributes >>
            ...
            <float64/>
         </Byte>
            <map name="longitude"/>
     </Group>
            <map name="latitude"/>
</Dataset>
            <map name="time"/>
         </array>
     </grid>
</dataset>
</pre>
</pre>



Revision as of 03:19, 15 April 2009

Back: DAP3/4#NC-DAP

Definitions

Type definition
In the DDX all items are declarations which describe things actually defined (assigned or otherwise associated with values) elsewhere. The term Type Definitions refers to the representation of something in a data set which defines a data type; the types are defined in the data set and the DDX merely holds a representation of that definition - a declaration.
Dimension
A name bound to a size, e.g., "lon" has a size of 1024
Coordinate variable
A name bound to both a size and a datatype, e.g.,"height" is a vector of ten 32-bit floating point numbers, or "latitude" is an array of 1024 by 1024 32-bit floating point numbers.
Grid
One or more N-dimensional array of values bound to 1 to N coordinate variables.

DDX Document Organization

DAP and the DDX will be extended to include Groups, Shared dimensions and user-defined types. Groups will be added as a kind of constructor-type with properties similar to Structure and to Java or C++ namespaces. Unlike Structure, Groups cannot be dimensioned.

A rough syntax which describes how these additions will fit into the DAP and the existing DDX Notation is (Replace with XML schema):

Dataset :== Groups
Groups :== null | Group Groups
Group :== Types Dimensions Attributes Variables
Types :== null | Type Types
Dimensions :== null | Dimension Dimensions
Attributes :== null | Attribute Attributes
Variables :== null | Variable Variables

This pseudo-grammar does not capture what can be produced for a Group, et cetera. Instead it shows how these sections of the DDX must be organized. It also does not show that a valid Dataset can have only Types (user-define types) and does not need to have variables, but it must have one or the other or both.

Group

The DDX will be modified so that it contains one or more Groups. If only one Group is present (which describes the case for DAP 3.2 and earlier) then the declaration can be left out, but if there are two or more groups, the declarations must be present.

Group characteristics:

  • Any configuration of Groups other than one (anonymous) Group which holds all the variables in a data set must be declared.
  • If declared, Groups must be named.
  • A Group can contain any object, including a Group
  • Variables and Attributes are named using / <group name> / ... / <variable name> to reflect their hierarchy.
  • Each Group declares a new lexical scope for values.
  • A Group cannot be an Array or a Grid (although the distinction between those two might become blurred or non-existent; Group is fundamentally a scalar container-type).
  • This definition does not completely subsume the HDF5 Group type but is equivalent to the netCDF 4 version of it.

Examples:

This data set contains one Group - the root group - which has by convention the name '/'

<Dataset ... >
    ...
</Dataset>

This data set contains two Groups, one after the other.

<Dataset ... >
<group name="primary">
    ...
</group>
<group name="secondary">
    ...
</group>
</Dataset>

This data set contains more Groups, and shows they can be nested.

<Dataset ... >
<group name="primary">
    ...
    <group name="in_situ">
        ...
    </group>

</group>
<group name="secondary">
    ...
</group>
</Dataset>

Discussion

In the past we have often talked about Dataset as a kind of Structure but implicitly it's not exactly the same since there cannot be an Array of datasets; The Group type captures this semantic distinction.

In HDF5, the Group object is modeled after a general graph but here it's uses a strict hierarchy, which simplifies both servers and clients while retaining most of the utility of the HDF5 data type.

Shared Dimensions

Background on shared dimensions and coordinate variables

From an email exchange, John Caron wrote:

James:

Is it that an dimension is a formal declaration of an independent parameter?

John:

I know that some people prefer that interpretation. My own opinion is that's it more complicated.

Abstractly, I think its reasonable to say that the number of dimensions of a variable indicates its dimensionality in the topological sense. I think its necessary to allow "independent variables" to have topological dimensionality > 1. eg lat(x,y), lon(x,y). lat and lon can still be considered independent variables, but they are not orthogonal. Neither is associated exclusively with one

dimension.

Concretely, dimensions are used for all sorts of reasons, and are not just about topological dimensionality. For instance, they control the grouping of data and the layout of files. So in real files, you see this mixture of uses.

That's why the explicit assignment of coord variables is needed, which makes your Grid attractive, because that's a way of explicitly saying what the independent variables are. One needs shared dimensions between data and coordinate variables, so that one can unambiguously assign coordinate values to a data value.

The downsides of using Grid for this purpose:

  • the name "Grid" connotes gridded data, eg model data, and this shared dimension thing is needed for other types of data, eg point data.
  • If Grid scopes the dimension, then all variables sharing a dimension have to be contained in the grid. So its impossible to have some dimensions globally shared, and others locally shared.

So my preference would be to use Groups to scope shared dimensions, rather than Grids. But still use Grids (or some evolution of Grids) to assign coordinate variables to data variables.

Dimensions

Shared dimensions will be added to DAP in the dimensions section of the Dataset or Group objects. Each dimension will consist of a name and a size.

 
<dimensions>
    <dim name="lat" size="1024"/>
    <dim name="lon" size="1024"/>
</dimensions>

Characteristics of dimensions:

  • Dimensions are not associated with a data type.
  • Dimensions do not have attributes.
  • Dimensions bound to a type define coordinate variables.

Coordinate variables and Grids

While dimensions are scoped at the Dataset or Group level, coordinate variables are defined at the level of a Grid object. Grid objects in DAP4 are different from those in DAP2 in two ways beyond using (shared) dimensions scoped at the Dataset/Group level:

  1. Each Grid object may hold more than one Array (what is often a dependent variable); and
  2. Each Array within a Grid is not constrained to use all of the Grid's coordinate variables.

N.B: Coordinate variables in a Grid object are called Maps to conform to the old nomenclature.

As with Grids in DAP2, each Grid object defines a lexical scope.

Here is a very simple Grid object:

<grid>
    <map name="lon" dim="lon" type="Float32"/>
    <map name="lat" dim="lat" type="Float32"/>

    <array name="SST">
        <Byte/>
        <map name="lon">
        <map name="lat">
    </array>
</grid>

Notes:

  1. The map object may have the same name as a dimension object.
  2. Map objects may have attributes.
  3. In an array object, <map...> elements are used to specify the array's dimensions; the word dimension is avoided to cut down on confusion.

A more complex Grid object:

<dataset>
    <dim name="pt" size="4096">

    <grid>
        <map name="longitude" dim="pt" type="Float32"/>
        <map name="latitude" dim="pt" type="Float32"/>
        <map name="altitude" dim="pt" type="Float32"/>
        <map name="time" dim="pt" type="Float32">
            << attributes >> <!-- The syntax for attributes is in flux -->
        </map>

        <array name="Radioactivity">
            << attributes >> <!-- for example, scale_factor and add_offset -->
            <Byte/>
            <map name="longitude"/>
            <map name="latitude"/>
            <map name="altitude"/>
            <map name="time"/>
        </array>

        <array name="surface_temp">
             << attributes >>
             <float64/>
             <map name="longitude"/>
             <map name="latitude"/>
             <map name="time"/>
        </array>
    </grid>
</dataset>

Using Grids

This is an alternative representation for dimensions using the Grid data type.

Dimensions can be introduced at the start of a Group (including the start of the implicit Group that exists at the Dataset level for a Dataset that does not declare any explicit Groups). This provides a way for dimensions to be shared between several different data arrays and these arrays are effectively a DAP 2 Grid variable. An alternative is to use the Grid type and equate its Maps with dimensions and allow for one or more arrays within the Grid. The change to Grid is to allow more than array in the Array: section. In addition, to capture the semantics of shared dimensions, the arrays within a grid would be freed from the restriction that all declared Maps be used by each array and that each dimension of every array be a Map.

The advantage of this design is that it does not tempt server/handler writers to use Groups to build lexical scopes since that are some data files (HDF5 only for now, but soon NetCDF4) which use the Group. If we modify DAP to encode those HDF5/NetCDF4 Groups as DAP Groups and use Grid in this way, then on the client side there is a better chance that data sets will have better fidelity - we can expect to save the result of a request as a HDF5 or NetCDF4 file and get closer to the original data set if its storage format was either HDF5 of NetCDF4.

A disadvantage is that the token Grid has one meaning in DAP2, ..., 3.2 and a different meaning in 3.3, ..., 4. This is likely not so great because the changes we are contemplating are fairly great so a parser and processing software are going to have to either recognize and work with both versions or reject the version they do not recognize.

Examples:

<Grid name="Data">
    <map name="longitude">
        <dimension size="1024">
    </map>
    <map name="latitude">
        <dimension size="1024">
    </map>
    <map name="height">
        <dimension size="7">
    <map>

    <Byte name="AIRT">
        <map name="longitude">
        <map name="latitude">
        <map name="height">
    </Byte>
    
    <!-- This array doesn't use all the maps -->
    <Byte name="SST">
        <map name="longitude">
        <map name="latitude">
    </Byte>

    ...
</Grid>

Types