DAP4: DAP4 Grids Proposal: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 89: Line 89:
''-Dennis Heimbigner''
''-Dennis Heimbigner''


=== Discussion ===
== Discussion ==


1) The CDM uses this object model for coordinate systems:
1) The CDM uses this object model for coordinate systems:
Line 114: Line 114:
--[[User:JohnCaron|JohnCaron]] 15:29, 2 March 2012 (PST)
--[[User:JohnCaron|JohnCaron]] 15:29, 2 March 2012 (PST)


=== Basic features of Grids in DAP4 ===
We're still discussing just how constraining "grids" works. I think that the model we choose needs to support:
We're still discussing just how constraining "grids" works. I think that the model we choose needs to support:
* N-dimensional coordinate variables (aka maps)
* N-dimensional coordinate variables (aka maps)
Line 122: Line 123:
I think optimizing transfers should be secondary to proper semantics.
I think optimizing transfers should be secondary to proper semantics.


[[User:Jimg|Jimg]] 17:11, 2 March 2012 (PST)
NB: This is already present in the [[DAP4: Data Model]]
 
[[User:Jimg|Jimg]] 17:11, 2 March 2012 (PST); Updated: [[User:Jimg|Jimg]]
 
=== Regarding Grids and Subsetting ===
I'm going to focus on the problems/issues associated with Grid subsetting. I think we need to focus on this because it's such an important feature of DAP - the ability to subset Grids.
 
Suppose we have two Grids G<sub>1</sub> and G<sub>2</sub> that share two maps M<sub>1</sub> and M<sub>2</sub>. If we subsample  G<sub>1</sub> and G<sub>2</sub> using two different ranges of M<sub>1</sub> and M<sub>2</sub>, then we have aproblem because we need one set of M<sub>1</sub> and M<sub>2</sub>  for G<sub>1</sub> and a different set of M<sub>1</sub> and M<sub>2</sub>  for G<sub>2</sub>, but since we no longer wrap the Grids G<sub>1</sub> and G<sub>2</sub> in their own lexical scope, that's cumbersome at best - some kind of scope has to be introduced into the result that was not previously present.
 
Potential solutions:
===== Add the enclosing Lexical Scope back into the Data Model =====
We can go back to the old data model where each Grid (G<sub>1</sub> and G<sub>2</sub>, e.g.) is wrapped in its own scope. This introduced two problems of its own:
# The scope was 'fictitious' in that while it was part of the DAP representation for the dataset, it was not present in the original;
# The extra scope had semantics associated with it that 'broke' when some subsetting operations were applied (e.g., when just the 'array part' was projected but not the 'maps')
# A side effect of this data model was that the storage/transmission size increased. Personally, while this was (is?) the most common complaint, it is not really such a big deal. With a handful of datasets, this really matters, but for most data I regard it as an optimization.
 
===== Change the way Grid subsetting works =====
We can change the way Grid subsetting works. Grids are define a class of a relational type - the indices of the Maps and Arrays function like foreign keys - but that does not mean that we have to return the Maps with each Grid subset. When a client subsets the two grids (G<sub>1</sub> and G<sub>2</sub>) it gets back only the Arrays that hold their data. This introduces some problems/issues:
# The 'Grid-ness' of G<sub>1</sub> and G<sub>2</sub> has been lost
# Clients must either make a coordinated request for the Maps M<sub>1</sub> and M<sub>2</sub> and we must solve the question of how to make two or ''different'' requests for the same variable (cf namespace issues) or...
# Clients must make two or more requests, one for each Grid and each of those must still contain requests for several variables to be really useful

Revision as of 23:34, 5 March 2012

Grids Delenda Est

(with apologies to Cato the Elder)

The grid construct as originally established in the DAP2 protocol has been a source of problems from its inception. The evolution of the notion of coordinate variables makes its use in its current form (or even closely similar forms) untenable.

1. Problem: Grid as scoping/lexical container

This means that properly sharing coordinate variables is not possible without duplication, which is highly undesirable.

Consider the following situation.

    Arrays: D1(x,y), D2(y,z), D3(x,z).
    coord vars: x(x), y(y), z(z)

No grid, as currently defined can represent this because the three coordinate variables x(x), y(y), and z(z), cannot be properly distributed across needed three grids without duplication. The only way this can work is if all the Arrays and all the coordinate variables reside in a single grid; not, I maintain, a useful solution. Further, the Grid must change if new arrays are defined that use any of the coordinate variables, D4(x,w), for example.

2. Problem: Grid projections

When a projection is applied to a grid, the result cannot be a grid. This has been an ongoing source of problems in DAP2 where projecting the array component of a grid results in a structure. From the point of view of semantics, this is a really bad idea.

3. Problem: Multi-dimensional coordinate variables

When representing point data, it is desirable to have coordinate variables distinguished using more than a single dimension. Consider the following:

    array: temp(x,y,z)
    coordinate vars: lat(x,y,z), lon(x,y,z), and depth(x,y,z).

Here we are trying to represent point data where each point is defined by three dimensions: lat, lon, and depth. Grids are not capable of properly representing this case. I should note that neither is, for example, netcdf-3 or netcdf-4. CDM can do it, by only by encoding the proper relationships as attributes with complex internal structure.

4. Problem: Coordinate Variable Duplication In examining a large number of DAP2 DDS's, I note that coordinate variables inside grids are almost always duplicated outside the grid. My hypothesis has been that this a result of the fact of problem (1) above. In any case, this proposal below would obviate the need for duplication.

Proposal: Grid as mapping

Rather than making grids be scope containers, grids need to be simple relationship instances between an array and its coordinate variables. This would be done by associating the coordinate variables with an array variable. For example, the first case above (D1,D2,D3) might be represented as:

<variable name="D1"...>
   <map coordinate="x"/>
   <map coordinates="y"/>
 </variable>
...

The case of point data would be represented as follows:

<variable name="temp"...>
   <map coordinate="lat"/>
   <map coordinates="lon"/>
   <map coordinates="depth"/>
</variable>

Note that the dimensions can be inferred from the specified coordinate variables.


-Dennis Heimbigner

Discussion

1) The CDM uses this object model for coordinate systems:

 CDM CoordSys

When translating things like GRIB into CDM, we usually also add the CF attributes, which simplifies things since now the coordsys info is encoded at the data access layer. This is very simple, in CDL:

float Temp(z,y,x);
 :coordinates = "lat, lon, depth";

2) It appears that a Variable that contains map elements is a "grid", and that when you make a data request for a grid, you get back the corresponding values of the maps. Correct?

One problem with that is when you have 2D coordinates, as in:


float lon(y,x);
float lat(y,x);
float Temp(z,y,x);
 :coordinates = "lat, lon, depth";

then you get back 3X more data, which you may not want.

--JohnCaron 15:29, 2 March 2012 (PST)

Basic features of Grids in DAP4

We're still discussing just how constraining "grids" works. I think that the model we choose needs to support:

  • N-dimensional coordinate variables (aka maps)
  • Shared dimensions
  • subsetting that returns a valid grid
  • subsetting that returns parts that make up the grid

I think optimizing transfers should be secondary to proper semantics.

NB: This is already present in the DAP4: Data Model

Jimg 17:11, 2 March 2012 (PST); Updated: Jimg

Regarding Grids and Subsetting

I'm going to focus on the problems/issues associated with Grid subsetting. I think we need to focus on this because it's such an important feature of DAP - the ability to subset Grids.

Suppose we have two Grids G1 and G2 that share two maps M1 and M2. If we subsample G1 and G2 using two different ranges of M1 and M2, then we have aproblem because we need one set of M1 and M2 for G1 and a different set of M1 and M2 for G2, but since we no longer wrap the Grids G1 and G2 in their own lexical scope, that's cumbersome at best - some kind of scope has to be introduced into the result that was not previously present.

Potential solutions:

Add the enclosing Lexical Scope back into the Data Model

We can go back to the old data model where each Grid (G1 and G2, e.g.) is wrapped in its own scope. This introduced two problems of its own:

  1. The scope was 'fictitious' in that while it was part of the DAP representation for the dataset, it was not present in the original;
  2. The extra scope had semantics associated with it that 'broke' when some subsetting operations were applied (e.g., when just the 'array part' was projected but not the 'maps')
  3. A side effect of this data model was that the storage/transmission size increased. Personally, while this was (is?) the most common complaint, it is not really such a big deal. With a handful of datasets, this really matters, but for most data I regard it as an optimization.
Change the way Grid subsetting works

We can change the way Grid subsetting works. Grids are define a class of a relational type - the indices of the Maps and Arrays function like foreign keys - but that does not mean that we have to return the Maps with each Grid subset. When a client subsets the two grids (G1 and G2) it gets back only the Arrays that hold their data. This introduces some problems/issues:

  1. The 'Grid-ness' of G1 and G2 has been lost
  2. Clients must either make a coordinated request for the Maps M1 and M2 and we must solve the question of how to make two or different requests for the same variable (cf namespace issues) or...
  3. Clients must make two or more requests, one for each Grid and each of those must still contain requests for several variables to be really useful