DAP Design: shared dimensions, groups and types: Difference between revisions

Latest revision as of 21:07, 24 April 2009

Definitions

Type definition: In the DDX all items are declarations which describe things actually defined (assigned or otherwise associated with values) elsewhere. The term Type Definitions refers to the representation of something in a data set which defines a data type; the types are defined in the data set and the DDX merely holds a representation of that definition - a declaration.

Dimension: A name bound to a size, e.g., "lon" has a size of 1024

Coordinate variable: A name bound to both a dimension and a datatype, e.g.,"height" is a vector of dimension "height" 32-bit floating point numbers, or "latitude" is an array of dimension "x" by dimension "y" 32-bit floating point numbers.

Grid: One or more N-dimensional array of values bound to 1 to N coordinate variables.

DDX Document Organization

DAP and the DDX will be extended to include Groups, Shared dimensions and user-defined types. Groups will be added as a kind of constructor-type with properties similar to Structure and to Java or C++ namespaces. Unlike Structure, Groups cannot be dimensioned.

A rough syntax which describes how these additions will fit into the DAP and the existing DDX Notation is (Replace with XML schema):

Dataset :== Groups
Groups :== null | Group Groups
Group :== Types Dimensions Attributes Variables Groups
Types :== null | Type Types
Dimensions :== null | Dimension Dimensions
Attributes :== null | Attribute Attributes
Variables :== null | Variable Variables

This pseudo-grammar does not capture what can be produced for a Group, et cetera. Instead it shows how these sections of the DDX must be organized. It also does not show that a valid Dataset can have only Types (user-define types) and does not need to have variables, but it must have one or the other or both.

Group

The DDX will be modified so that it contains one or more Groups. If only one Group is present (which describes the case for DAP 3.2 and earlier) then the declaration can be left out, but if there are two or more groups, the declarations must be present.

Group characteristics:

Any configuration of Groups other than one (anonymous) Group which holds all the variables in a data set must be declared.
If declared, Groups must be named.
A Group can contain any object, including a Group
Variables and Attributes are named using / <group name> / ... / <variable name> to reflect their hierarchy.
Each Group declares a new lexical scope for values.
A Group cannot be an Array or a Grid (although the distinction between those two might become blurred or non-existent; Group is fundamentally a scalar container-type).
This definition does not completely subsume the HDF5 Group type but is equivalent to the netCDF 4 version of it.

Examples:

This data set contains one Group - the root group - which has by convention the name '/'

<Dataset ... >
    ...
</Dataset>

This data set contains two Groups, one after the other.

<Dataset ... >
<group name="primary">
    ...
</group>
<group name="secondary">
    ...
</group>
</Dataset>

This data set contains more Groups, and shows they can be nested.

<Dataset ... >
<group name="primary">
    ...
    <group name="in_situ">
        ...
    </group>

</group>
<group name="secondary">
    ...
</group>
</Dataset>

Discussion

In the past we have often talked about Dataset as a kind of Structure but implicitly it's not exactly the same since there cannot be an Array of datasets; The Group type captures this semantic distinction.

In HDF5, the Group object is modeled after a general graph but here it's uses a strict hierarchy, which simplifies both servers and clients while retaining most of the utility of the HDF5 data type.

Shared Dimensions

Background on shared dimensions and coordinate variables

From an email exchange, John Caron wrote:

James:

Is it that an dimension is a formal declaration of an independent parameter?

John:

I know that some people prefer that interpretation. My own opinion is that's it more complicated.
Abstractly, I think its reasonable to say that the number of dimensions of a variable indicates its dimensionality in the topological sense. I think its necessary to allow "independent variables" to have topological dimensionality > 1. eg lat(x,y), lon(x,y). lat and lon can still be considered independent variables, but they are not orthogonal. Neither is associated exclusively with one
dimension.

Concretely, dimensions are used for all sorts of reasons, and are not just about topological dimensionality. For instance, they control the grouping of data and the layout of files. So in real files, you see this mixture of uses.

That's why the explicit assignment of coord variables is needed, which makes your Grid attractive, because that's a way of explicitly saying what the independent variables are. One needs shared dimensions between data and coordinate variables, so that one can unambiguously assign coordinate values to a data value.

The downsides of using Grid for this purpose:

the name "Grid" connotes gridded data, eg model data, and this shared dimension thing is needed for other types of data, eg point data.

If Grid scopes the dimension, then all variables sharing a dimension have to be contained in the grid. So its impossible to have some dimensions globally shared, and others locally shared.

So my preference would be to use Groups to scope shared dimensions, rather than Grids. But still use Grids (or some evolution of Grids) to assign coordinate variables to data variables.

Dimensions

Shared dimensions will be added to DAP in the dimensions section of the Dataset or Group objects. Each dimension will consist of a name and a size.

 
<dimension name="lat" size="1024"/>
<dimension name="lon" size="1024"/>

Characteristics of dimensions:

Dimensions are not associated with a data type.
Dimensions do not have attributes.
Dimensions bound to a type define coordinate variables.
Shared dimensions may be used by both Grids and Arrays.

Coordinate variables and Grids

While dimensions are scoped at the Dataset or Group level, coordinate variables are defined at the level of a Grid object. Grid objects in DAP4 are different from those in DAP2 in three ways beyond using (shared) dimensions:

Each Grid object may hold more than one Array (what is often a dependent variable);
Maps (often independent variables) may have more than one dimension; and
Each Array within a Grid is not constrained to use all of the Grid's coordinate variables.

N.B: Coordinate variables in a Grid object are called Maps to conform to the old nomenclature and to avoid (re)using the word dimension.

Features of the DAP4 and DAP2 Grid object:

Each Grid object defines a lexical scope.
There is an explicit relation between the Grid object's maps (coordinate variables) and the indicial extents of the array.

A very simple Grid object

<grid>
    <map name="lon" dim="lon" type="Float32"/>
    <map name="lat" dim="lat" type="Float32"/>

    <array name="SST">
        <Byte/>
        <map name="lon">
        <map name="lat">
    </array>
</grid>

Notes:

The map object may have the same name as a dimension object.
Map objects may have attributes, even though they are not shown in the example.
In an Grid's array object, <map...> elements are used to specify the array's dimensions; the word dimension is avoided to cut down on confusion.

A more complex Grid object

<dataset>
    <dimension name="pt" size="4096">

    <grid>
        <map name="longitude" dim="pt" type="Float32"/>
        <map name="latitude" dim="pt" type="Float32"/>
        <map name="altitude" dim="pt" type="Float32"/>
        <map name="time" dim="pt" type="Float32">
            << attributes >> <!-- The syntax for attributes is in flux -->
        </map>

        <array name="Radioactivity">
            << attributes >> <!-- for example, scale_factor and add_offset -->
            <Byte/>
            <map name="longitude"/>
            <map name="latitude"/>
            <map name="altitude"/>
            <map name="time"/>
        </array>

        <array name="surface_temp">
             << attributes >>
             <float64/>
             <map name="longitude"/>
             <map name="latitude"/>
             <map name="time"/>
        </array>
    </grid>
</dataset>

An example Grid with Maps that are not vectors

<dataset>
    <dimension name="x" size="4096">
    <dimension name="y" size="4096">

    <grid name="SST_Swath">
        <!-- We could list multiple dims in a space-separated list
             but purists will gag. I'm experimenting with different 
             syntaxes -->
        <map name="longitude" type="Float32"/>
            <dim name="x"/>
            <dim name="y"/>
        </map>
        <map name="latitude" type="Float32"/>
            <dim name="x"/>
            <dim name="y"/>
        </map>

        <!-- This grid has two maps, each of which are two-dimensional
             arrays. It can be used to store satellite 'swath' data. -->
        <array name="SST">
            << attributes >> <!-- for example, scale_factor and add_offset -->
            <Byte/>
            <map name="longitude"/>
            <map name="latitude"/>
        </array>
    </grid>
</dataset>

Note:

The highest dimension of the Grid's Maps cannot exceed the dimensionality of the Grid's Array.
When using the [] operator on a Grid in a DAP Constraint expression, the arguments enclosed in the square brackets correspond to the dimensions declared in the Map and not the Maps themselves. Thus a CE like SST_Swath[10:20][40:50] means that the array SST_Swath.SST and the maps SST_Swath.longitude and SST_Swath.latitude will all be returned sub-sampled to elements 10 to 20 in their first dimension and 40 to 50 in their second. In a DAP2 grid where all of the maps are vectors, there is a one-to-one correspondence between the [] operators and Maps, but in a DAP4 Grid there is a one-to-one correspondence between the [] operators and dimensions.

Types

Add a section for type definitions (technically, these are type equivalents in the sense of C's typedef) and allow those type equivalents to be used interchangeably within the existing DDX notation. This would provide a way for a dataset stored using HDF5 to be presented without loosing any type definitions it uses. At the same time this would allow the existing notation which does not require types be defined in cases where none were (e.g., a HDF 4 data file cannot contain type definitions).

The syntax for a type definition

<typedef name="point">
    << attributes >>
    <structure name="point">
        <int32 name="x"/>
        <int32 name="y"/>
    </structure>
</typedef>

Notes:

The contents of the <typedef> element are a DDX type
The enclosed type and the name become synonymous.
The enclosed type can include attributes.
The type definition itself can include (optional) attributes.

Use of a type definition, assuming the typedef above

<variable name="points" type="point">
    <dim size="100>
</variable>

or

<variable name="points">
    <type name="point"/>
    <dim size="100"/>
</variable>

Notes:

The XML elements <variable ...> and/or <type ..> are used because the typedef name can be any legal XML attribute value (a requirement because names in HDF5 are essentially not restricted to a particular character set) and thus cannot be an XML element name.
This example uses the new (proposed) syntax for array declarations.
Right now instead of a <variable ...> element in the DDX we have a collection of things like <Grid...>, <Array...> but we cannot reliably introduce arbitrary element names because of the character restrictions XML places on them. So I'm suggesting that we change the DDX to use the <variable...> element everywhere. And adopt the notion that A variable is scalar or vector/array depending on whether the element includes one or more <dim ...> elements.

DAP Design: shared dimensions, groups and types: Difference between revisions

Latest revision as of 21:07, 24 April 2009

Contents

Definitions

DDX Document Organization

Group

Examples:

Discussion

Shared Dimensions

Background on shared dimensions and coordinate variables

Dimensions

Coordinate variables and Grids

A very simple Grid object

A more complex Grid object

An example Grid with Maps that are not vectors

Types

Navigation menu

Page actions

Personal tools

Search

Tools

@@ Line 9: / Line 9: @@
 ;Coordinate variable
-: A name bound to both a size and a datatype, e.g.,"height" is a vector of ten 32-bit floating point numbers, or "latitude" is an array of 1024 by 1024 32-bit floating point numbers.
+: A name bound to both a dimension and a datatype, e.g.,"height" is a vector of dimension "height" 32-bit floating point numbers, or "latitude" is an array of dimension "x" by dimension "y" 32-bit floating point numbers.
 ;Grid
@@ Line 23: / Line 23: @@
 Dataset :== Groups
 Groups :== null | Group Groups
-Group :== Types Dimensions Attributes Variables
+Group :== Types Dimensions Attributes Variables Groups
 Types :== null | Type Types
 Dimensions :== null | Dimension Dimensions
@@ Line 90: / Line 90: @@
 == Shared Dimensions ==
-Shared dimensions will be added to DAP so that special variables used to identify the independent parameters in non-scalar variables can be clearly labeled as such. When a dimension is named, each use of that dimension means ''the same values''. Each dimension will have both a name and a size in addition to a type. The moniker ''Shared Dimension'' is really redundant. Any ''Dimension'' in DAP 3.3 can be shared. There is no requirement that a dimension be used; it can be declared and never used.
+=== Background on shared dimensions and coordinate variables ===
-[[#Using Grids|Here's an alternative representation using the Grid datatype]]
+From an email exchange, John Caron wrote:
-=== Dimension examples ===
+James:
+<blockquote>Is it that an dimension is a formal declaration of an independent parameter?</blockquote>
-Declaring dimensions in the DDX:
+John:
+<blockquote><p>I know that some people prefer that interpretation. My own opinion is that's it more complicated.
+Abstractly, I think its reasonable to say that the number of dimensions of a variable indicates its
+dimensionality in the topological sense. I think its necessary to allow "independent variables" to
+have topological dimensionality > 1. eg lat(x,y), lon(x,y). lat and lon can still be considered
+independent variables, but they are not orthogonal. Neither is associated exclusively with one
+dimension.</p>
-<pre>
+<p>Concretely, dimensions are used for all sorts of reasons, and are not just about topological
-<Dataset name="dimension_ex_1" ...>
+dimensionality. For instance, they control the grouping of data and the layout of files. So in real
-    <!-- The 'dimensions' section must come first if present -->
+files, you see this mixture of uses.</p>
-    <dimensions>
-        <!-- Dimensions are declared like an array except that the
+<p>That's why the explicit assignment of coord variables is needed, which makes your Grid attractive,
-             'dimension' element is replaced by an 'extent' element
+because that's a way of explicitly saying what the independent variables are. One needs shared
-             and the extent has no name (since the dimension itself
+dimensions between data and coordinate variables, so that one can unambiguously assign coordinate
-             is named -->
+values to a data value.</p>
-        <Int32 name="latitude">
-            <extent size="1024">
+<p>The downsides of using Grid for this purpose:
-        </Int32>
+<ul>
-        <String name="color">
+<li>the name "Grid" connotes gridded data, eg model data, and this shared dimension thing is needed
-            <extent size="3">
+for other types of data, eg point data.</li>
-        </String>
+<li>If Grid scopes the dimension, then all variables sharing a dimension have to be contained in the
-    </dimensions>
+grid. So its impossible to have some dimensions globally shared, and others locally shared.</li>
+</ul></p>
+<p>So my preference would be to use Groups to scope shared dimensions, rather than Grids. But still use
+Grids (or some evolution of Grids) to assign coordinate variables to data variables.</p>
+</blockquote>
+=== Dimensions ===
-    <!-- remainder of the document -->
+Shared dimensions will be added to DAP in the ''dimensions'' section of the ''Dataset'' or ''Group'' objects. Each dimension will consist of a name and a size.
-</Dataset>
+<pre>
+<dimension name="lat" size="1024"/>
+<dimension name="lon" size="1024"/>
 </pre>
-Using dimensions in the DDX. In DAP, dimensions can only be used in a Grid variable;
+Characteristics of dimensions:
+* Dimensions are not associated with a data type.
+* Dimensions do not have attributes.
+* Dimensions bound to a type define coordinate variables.
+* Shared dimensions may be used by both Grids and Arrays.
+== Coordinate variables and Grids ==
+While dimensions are scoped at the Dataset or Group level, coordinate variables are defined at the level of a Grid object. Grid objects in DAP4 are different from those in DAP2 in three ways beyond using (shared) dimensions:
+# Each Grid object may hold more than one ''Array'' (what is often a dependent variable);
+# Maps (often independent variables) may have more than one dimension; and
+# Each Array within a Grid is not constrained to use all of the Grid's coordinate variables.
+N.B: ''Coordinate variables'' in a Grid object are called ''Maps'' to conform to the old nomenclature and to avoid (re)using the word ''dimension''.
+Features of the DAP4 and DAP2 Grid object:
+# Each Grid object defines a lexical scope.
+# There is an explicit relation between the Grid object's maps (coordinate variables) and the indicial extents of the array.
+=== A very simple Grid object ===
 <pre>
-<Dataset name="dimension_ex_1" ...>
+<grid>
-     <dimensions>
+     <map name="lon" dim="lon" type="Float32"/>
-        <Int32 name="latitude">
+    <map name="lat" dim="lat" type="Float32"/>
-            <dimension size="1024">
-        </Int32>
-        <Int32 name="longitude">
-            <dimension size="1024">
-        </Int32>
-    </dimensions>
-     <!-- The two declarations that follow are effectively Grids,
+     <array name="SST">
-         but now share the same dimentions -->
+         <Byte/>
-    <Byte name="uwnd">
+         <map name="lon">
-         <!-- note that only the name is used, not the size -->
+         <map name="lat">
-         <map name="latitude">
+     </array>
-         <map name="longitude">
+</grid>
-     </Byte>
+</pre>
-    <Byte name="vwnd">
+Notes:
-        <map name="latitude">
+# The ''map'' object may have the same name as a ''dimension'' object.
-        <map name="longitude">
+# Map objects may have attributes, even though they are not shown in the example.
-    </Byte>
+# In an Grid's ''array'' object, ''<map...>'' elements are used to specify the array's dimensions; the word ''dimension'' is avoided to cut down on confusion.
-     ...
+=== A more complex Grid object ===
+<pre>
+<dataset>
+     <dimension name="pt" size="4096">
+    <grid>
+        <map name="longitude" dim="pt" type="Float32"/>
+        <map name="latitude" dim="pt" type="Float32"/>
+        <map name="altitude" dim="pt" type="Float32"/>
+        <map name="time" dim="pt" type="Float32">
+            << attributes >> <!-- The syntax for attributes is in flux -->
+        </map>
+        <array name="Radioactivity">
+            << attributes >> <!-- for example, scale_factor and add_offset -->
+            <Byte/>
+            <map name="longitude"/>
+            <map name="latitude"/>
+            <map name="altitude"/>
+            <map name="time"/>
+        </array>
-</Dataset>
+        <array name="surface_temp">
+             << attributes >>
+             <float64/>
+             <map name="longitude"/>
+             <map name="latitude"/>
+             <map name="time"/>
+        </array>
+    </grid>
+</dataset>
 </pre>
-Here's an example that shows how Groups might be used to deal with the case where there are two different dimensions called 'longitude' and 'latitude'. Note that you cannot use ''Structure'' to do this because a Structure cannot be used to introduce dimensions.
+=== An example Grid with Maps that are not vectors ===
 <pre>
-<Dataset ...>
+<dataset>
-     <Group name="combined">
+     <dimension name="x" size="4096">
-        <dimensions>
+    <dimension name="y" size="4096">
-            <Int32 name="latitude">
-                <dimension size="1024">
-            </Int32>
-            ...
-        </dimensions>
-         <Byte name="uwnd">
+    <grid name="SST_Swath">
-             <map name="latitude">
+         <!-- We could list multiple dims in a space-separated list
-             ...
+             but purists will gag. I'm experimenting with different
-        </Byte>
+             syntaxes -->
-    </Group>
+        <map name="longitude" type="Float32"/>
+            <dim name="x"/>
+             <dim name="y"/>
+        </map>
+        <map name="latitude" type="Float32"/>
+             <dim name="x"/>
+            <dim name="y"/>
+        </map>
-    <Group name="raw">
+        <!-- This grid has two maps, each of which are two-dimensional
-        <dimensions>
+             arrays. It can be used to store satellite 'swath' data. -->
-             <Int32 name="latitude">
+        <array name="SST">
-                <dimension size="2048">
+            << attributes >> <!-- for example, scale_factor and add_offset -->
-            </Int32>
+             <Byte/>
-            ...
+            <map name="longitude"/>
-        </dimensions>
+            <map name="latitude"/>
+        </array>
+    </grid>
+</dataset>
-        <Byte name="uwnd">
-            <map name="latitude">
-            ...
-        </Byte>
-    </Group>
-</Dataset>
 </pre>
-=== Using Grids ===
+Note:
+# The highest dimension of the Grid's Maps cannot exceed the dimensionality of the Grid's Array.
+# When using the ''[]'' operator on a Grid in a DAP Constraint expression, the arguments enclosed in the square brackets correspond to the ''dimensions'' declared in the Map and not the Maps themselves. Thus a CE like ''SST_Swath[10:20][40:50]'' means that the array ''SST_Swath.SST'' and the maps ''SST_Swath.longitude'' and ''SST_Swath.latitude'' will all be returned sub-sampled to elements 10 to 20 in their first dimension and 40 to 50 in their second. '''In a DAP2 grid where all of the maps are vectors, there is a one-to-one correspondence between the ''[]'' operators and Maps, but in a DAP4 Grid there is a one-to-one correspondence between the ''[]'' operators and ''dimensions''.'''
+== Types ==
+Add a section for type definitions (technically, these are ''type equivalents'' in the sense of C's ''typedef'') and allow those type equivalents to be used interchangeably within the existing DDX notation. This would provide a way for a dataset stored using HDF5 to be presented without loosing any type definitions it uses. At the same time this would allow the existing notation which does not require types be defined in cases where none were (e.g., a HDF 4 data file cannot contain type definitions).
-This is an alternative representation for dimensions using the Grid data type.
+The syntax for a type definition
-Dimensions can be introduced at the start of a Group (including the start of the implicit Group that exists at the Dataset level for a Dataset that does not declare any explicit Groups). This provides a way for dimensions to be shared between several different data arrays and these arrays are effectively a DAP 2 Grid variable. An alternative is to use the Grid type and equate its Maps with dimensions and allow for one or more arrays within the Grid. The change to Grid is to allow more than array in the ''Array:'' section. In addition, to capture the semantics of ''shared dimensions'', the arrays within a grid would be freed from the restriction that all declared Maps be used by each array and that each dimension of every array be a Map.
+<pre>
+<typedef name="point">
+    << attributes >>
+    <structure name="point">
+        <int32 name="x"/>
+        <int32 name="y"/>
+    </structure>
+</typedef>
+</pre>
-The advantage of this design is that it does not tempt server/handler writers to use Groups to build lexical scopes since that are some data files (HDF5 only for now, but soon NetCDF4) which use the Group. If we modify DAP to encode those HDF5/NetCDF4 Groups as DAP Groups and use Grid in this way, then on the client side there is a better chance that data sets will have better fidelity - we can expect to save the result of a request as a HDF5 or NetCDF4 file and get closer to the original data set if its storage format was either HDF5 of NetCDF4.
+Notes:
+# The contents of the ''<typedef>'' element are a DDX type
+# The enclosed type and the ''name'' become synonymous.
+# The enclosed type can include attributes.
+# The type definition itself can include (optional) attributes.
-A disadvantage is that the token ''Grid'' has one meaning in DAP2, ..., 3.2 and a different meaning in 3.3, ..., 4. This is likely not so great because the changes we are contemplating are fairly great so a parser and processing software are going to have to either recognize and work with both versions or reject the version they do not recognize.
+Use of a type definition, assuming the typedef above
-Examples:
 <pre>
-<Grid name="Data">
+<variable name="points" type="point">
-    <map name="longitude">
+    <dim size="100>
-        <dimension size="1024">
+</variable>
-    </map>
+</pre>
-    <map name="latitude">
-        <dimension size="1024">
-    </map>
-    <map name="height">
-        <dimension size="7">
-    <map>
-    <Byte name="AIRT">
+or
-        <map name="longitude">
-        <map name="latitude">
-        <map name="height">
-    </Byte>
-    <!-- This array doesn't use all the maps -->
-    <Byte name="SST">
-        <map name="longitude">
-        <map name="latitude">
-    </Byte>
-     ...
+<pre>
-</Grid>
+<variable name="points">
+    <type name="point"/>
+     <dim size="100"/>
+</variable>
 </pre>
-== Types ==
+Notes:
+# The XML elements ''<variable ...>'' and/or ''<type ..>'' are used because the typedef name can be any legal XML attribute value (a requirement because names in HDF5 are essentially not restricted to a particular character set) and thus cannot be an XML element name.
+# This example uses the new (proposed) syntax for array declarations.
+# Right now instead of a ''<variable ...>'' element in the DDX we have a collection of things like ''<Grid...>'', ''<Array...>'' but we cannot reliably introduce arbitrary element names because of the character restrictions XML places on them. So I'm suggesting that we change the DDX to use the ''<variable...>'' element everywhere. And adopt the notion that A ''variable'' is scalar or vector/array depending on whether the element includes one or more ''<dim ...>'' elements.