DAP4: DAP4 Grids Proposal: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
No edit summary
Line 1: Line 1:
[[OPULS_Development| <<back to OPULS Development]]
== Grids Delenda Est ==
== Grids Delenda Est ==
(with apologies to Cato the Elder)
(with apologies to Cato the Elder)

Revision as of 01:02, 6 March 2012

<<back to OPULS Development

Grids Delenda Est

(with apologies to Cato the Elder)

The grid construct as originally established in the DAP2 protocol has been a source of problems from its inception. The evolution of the notion of coordinate variables makes its use in its current form (or even closely similar forms) untenable.

1. Problem: Grid as scoping/lexical container

This means that properly sharing coordinate variables is not possible without duplication, which is highly undesirable.

Consider the following situation.

    Arrays: D1(x,y), D2(y,z), D3(x,z).
    coord vars: x(x), y(y), z(z)

No grid, as currently defined can represent this because the three coordinate variables x(x), y(y), and z(z), cannot be properly distributed across needed three grids without duplication. The only way this can work is if all the Arrays and all the coordinate variables reside in a single grid; not, I maintain, a useful solution. Further, the Grid must change if new arrays are defined that use any of the coordinate variables, D4(x,w), for example.

2. Problem: Grid projections

When a projection is applied to a grid, the result cannot be a grid. This has been an ongoing source of problems in DAP2 where projecting the array component of a grid results in a structure. From the point of view of semantics, this is a really bad idea.

3. Problem: Multi-dimensional coordinate variables

When representing point data, it is desirable to have coordinate variables distinguished using more than a single dimension. Consider the following:

    array: temp(x,y,z)
    coordinate vars: lat(x,y,z), lon(x,y,z), and depth(x,y,z).

Here we are trying to represent point data where each point is defined by three dimensions: lat, lon, and depth. Grids are not capable of properly representing this case. I should note that neither is, for example, netcdf-3 or netcdf-4. CDM can do it, by only by encoding the proper relationships as attributes with complex internal structure.

4. Problem: Coordinate Variable Duplication In examining a large number of DAP2 DDS's, I note that coordinate variables inside grids are almost always duplicated outside the grid. My hypothesis has been that this a result of the fact of problem (1) above. In any case, this proposal below would obviate the need for duplication.

Proposal: Grid as mapping

NB: The current Data Model page in the straw man design already does this. Unfortunately, I used XML for the examples (which muddies the idea of an abstract information model with one particular representation of that model) but I think what is presented there is the same as this proposal.Jimg

Rather than making grids be scope containers, grids need to be simple relationship instances between an array and its coordinate variables. This would be done by associating the coordinate variables with an array variable. For example, the first case above (D1,D2,D3) might be represented as:

<variable name="D1"...>
   <map coordinate="x"/>
   <map coordinates="y"/>
 </variable>
...

The case of point data would be represented as follows:

<variable name="temp"...>
   <map coordinate="lat"/>
   <map coordinates="lon"/>
   <map coordinates="depth"/>
</variable>

Note that the dimensions can be inferred from the specified coordinate variables.


-Dennis Heimbigner

Discussion

1) The CDM uses this object model for coordinate systems:

 CDM CoordSys

When translating things like GRIB into CDM, we usually also add the CF attributes, which simplifies things since now the coordsys info is encoded at the data access layer. This is very simple, in CDL:

float Temp(z,y,x);
 :coordinates = "lat, lon, depth";

2) It appears that a Variable that contains map elements is a "grid", and that when you make a data request for a grid, you get back the corresponding values of the maps. Correct?

One problem with that is when you have 2D coordinates, as in:


float lon(y,x);
float lat(y,x);
float Temp(z,y,x);
 :coordinates = "lat, lon, depth";

then you get back 3X more data, which you may not want.

--JohnCaron 15:29, 2 March 2012 (PST)

Basic features of Grids in DAP4

We're still discussing just how constraining "grids" works. I think that the model we choose needs to support:

  • N-dimensional coordinate variables (aka maps)
  • Shared dimensions
  • subsetting that returns a valid grid
  • subsetting that returns parts that make up the grid

I think optimizing transfers should be secondary to proper semantics.

NB: This is already present in the DAP4: Data Model

Jimg 17:11, 2 March 2012 (PST); Updated: Jimg

Regarding Grids and Subsetting

I'm going to focus on the problems/issues associated with Grid subsetting. I think we need to focus on this because it's such an important feature of DAP - the ability to subset Grids.

Suppose we have two Grids G1 and G2 that share two maps M1 and M2. If we subsample G1 and G2 using two different intervals over M1 and M2, then we have a problem because we need one set of M1 and M2 for G1 and a different set of M1 and M2 for G2. This only gets more complex when shared dimensions (aka Dimensions from now on) are added into the mix; however, any solution that solves the problem for shared coordinate variables (aka Maps) does so for Dimensions if there is a one-to-one relationship between Maps and Dimensions for any set of Grids that are subset.

Potential solutions:

Add the enclosing Lexical Scope back into the Data Model

We can go back to the old data model where each Grid (G1 and G2, e.g.) is wrapped in its own scope. Problems with this solution:

  1. The scope is 'fictitious' in that while it is part of the DAP representation for the dataset, it is not present in the original;
  2. The extra scope has semantics associated with it that 'fail' when some subsetting operations in DAP2 are applied (e.g., when just the 'array part' was projected but not the 'maps'), so we'll have to address those issues; and
  3. A side effect of this data model is that the storage/transmission size increased. I don't feel optimizing for transmission size is as important as a flexible data model, but we should obviously not transmit excess information.
Change the way Grid subsetting works

We can change the way Grid subsetting works. Grids are defined a class of a relational type - the indices of the Maps and Arrays function like foreign keys - but that does not mean that we have to return the Maps with each Grid subset. When a client subsets the two grids (G1 and G2) it gets back only the Arrays that hold their data. This introduces some problems/issues:

  1. The 'Grid-ness' of G1 and G2 has been lost
  2. Clients must either make a coordinated request for the Maps M1 and M2 and we must solve the question of how to make two or different requests for the same variable (cf namespace issues) or...
  3. Clients must make two or more requests, one for each Grid and each of those must still contain requests for several variables to be really useful and ...
  4. How does teh client find out about the Dimensions?
Restrict the kinds of subsetting allowed

We can limit the ways in which a Grid can be subset. This might be what Dennis proposes in ... but I'm not sure. In this scheme, for any given request, a set of Grids like G1 and G2 that share maps could only be subset using one interval over M1 and M2. If a client attempts to use two or more intervals, it gets an error. This introduces some problems, too:

  1. If a client wants to subset of different intervals of the maps, it must make several requests
  2. Since clients get back 'consistent' responses, a client might, as John says, get back more data than it wants.

slightly more formal statement of the third solution

I thought a more formal version of this might be useful in ferreting out any more problems besides the two already mentioned (which are repeated below)

Definitions:

Grid
A Grid is an Array variable that has, in addition to it's dimensions which define its rank and extent, a set of maps that provide coordinate space values for the data stored in the Array.
Map
A Map is an Array that uses one or more Dimensions to define a coordinate space and are used by Grids. NB: A plain Array can use Dimensions, but it's not a Map if it is not used by a Grid as such.
Dimension
A Dimension is the binding of a name to a size and is used to define the rank and extent of an Array.
  • Grid subsetting can only take place using intervals defined over its maps.
  • When one or more Grids share maps, any given subsetting request can subset one or more of those Grids, but must use the same map interval for them all.
  • The result of a Grid subsetting operation includes the tuple of Dimension(s), Map(s) and Array(s) that make up the subset Grids.
  • It is possible to subset the 'Array part' of a Grid and get back just the Array.

Problems/issues:

  1. If a client wants to subset of different intervals of the maps, it must make several requests
  2. Since clients get back 'consistent' responses, a client might, as John says, get back more data than it wants.