DAP4: DAP4 Grids Proposal: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
 
(43 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== Grids Delenda Est ==
[[Category:Development|Development]][[Category:DAP4|DAP4]]
(with apologies to Cato the Elder)
[[OPULS_Development| << Back to OPULS Development]]
 
== Background ==


The grid construct as originally
The grid construct as originally
Line 10: Line 12:
forms) untenable.
forms) untenable.


''1. Problem: Grid as scoping/lexical container''
== Problems Addressed ==
 
=== Grid as scoping/lexical container ===


This means that properly sharing coordinate variables is not
This means that properly sharing coordinate variables is not
Line 16: Line 20:


Consider the following situation.
Consider the following situation.
<font size="2">
<pre>
<pre>
     Arrays: D1(x,y), D2(y,z), D3(x,z).
     Arrays: D1(x,y), D2(y,z), D3(x,z).
     coord vars: x(x), y(y), z(z)
     coord vars: x(x), y(y), z(z)
</pre>
</pre>
</font>
No grid, as currently defined can represent
No grid, as currently defined can represent
this because the three coordinate variables x(x), y(y), and z(z),
this because the three coordinate variables x(x), y(y), and z(z),
Line 29: Line 35:
D4(x,w), for example.
D4(x,w), for example.


''2. Problem: Grid projections''
=== Grid projections ===


When a projection is applied to a grid, the result cannot be
When a projection is applied to a grid, the result cannot be
Line 37: Line 43:
a really bad idea.
a really bad idea.


''3. Problem: Multi-dimensional coordinate variables''
=== Multi-dimensional coordinate variables ===


When representing point data, it is desirable to have coordinate
When representing point data, it is desirable to have coordinate
variables distinguished using more than a single dimension.
variables distinguished using more than a single dimension.
Consider the following:
Consider the following:
<font size="2">
<pre>
<pre>
     array: temp(x,y,z)
     array: temp(x,y,z)
     coordinate vars: lat(x,y,z), lon(x,y,z), and depth(x,y,z).
     coordinate vars: lat(x,y,z), lon(x,y,z), and depth(x,y,z).
</pre>
</pre>
</font>


Here we are trying to represent point data where each point is
Here we are trying to represent point data where each point is
Line 54: Line 62:
complex internal structure.
complex internal structure.


''4. Problem: Coordinate Variable Duplication''
=== Coordinate Variable Duplication ===
In examining a large number of DAP2 DDS's, I note
In examining a large number of DAP2 DDS's, I note
that coordinate variables inside grids are almost
that coordinate variables inside grids are almost
Line 62: Line 70:
need for duplication.
need for duplication.


=== Proposal: Grid as mapping ===
== Proposal: Grid as mapping ==
<font color="red">NB: The current Data Model page in the straw man design already does this. Unfortunately, I used XML for the examples (which muddies the idea of an abstract information model with one particular representation of that model) but I think what is presented there is the same as this proposal.</font>[[User:Jimg|Jimg]]
Rather than making grids be scope containers, grids need to be simple relationship instances between an array and its coordinate variables. This would be done by associating the coordinate variables with an array variable.
 
Specifically:
# The Grid data type in DAP4 should shed the enclosing lexical scope
# Grid is a relation that binds one or more coordinate variables (aka maps) to one Array.


Rather than making grids be scope containers, grids need to be
Using OGC coverage terminology, we have this.
simple relationship instances between an array and its coordinate
# The maps specify the ''Domain''
variables. This would be done by associating the coordinate variables
# The array specifies the ''Range''
with an array variable. For example, the first case above (D1,D2,D3)
# The Grid itself is a ''Coverage''
might be represented as:
# The Domain and Range are sampled functions
<pre>
<variable name="D1"...>
  <map coordinate="x"/>
  <map coordinates="y"/>
</variable>
...
</pre>


The case of point data would be represented as follows:
There are a number of constraints on the form of maps and their relationship to the Array.
<pre>
<variable name="temp"...>
  <map coordinate="lat"/>
  <map coordinates="lon"/>
  <map coordinates="depth"/>
</variable>
</pre>
Note that the dimensions can be inferred from the specified coordinate variables.


Assume we have a (grid) array of the form
<pre>float32 A[d1...,dn]</pre>
and associated maps
<pre>float32 M1[d1,d2];
float32 M2[d3,d4];</pre>
Assume the following definitions:
* Let {A} be the set of dimensions of A, namely {d1,...dn}.
* Let {D} be the set of dimensions mentioned in any of the map variables, so in our case above {D} = {M1} union {M2}.
* Let |A| be the rank of A (n in this case).
* Let |{...}| be the number of elements in a set.
* Let {Mi} be the set of maps (= {M1,M2} in this case).


''-Dennis Heimbigner''
Using these notations, the contraints are as follows.
# |Mi| <= |A| : i.e. each map var has a dimension no more than that of the grid array.
# {Mi} has no fixed upper bound : i.e. there can be as many maps as desired.
# {D} = {A} or {D} is a subset of {A} : i.e. every named dimension mentioned in the map variables must appear in the set of dimensions of A.
# |A| = |{A}| : i.e. the dimensions of A may not contain duplicates so A[x,x] is disallowed.
# {Mi} is in fact a set, which means that any duplicates are ignored and the order is irrelevant. So {Mi} = {v1,v1,v2} is the same as {m1,m2} is the same as {m2,m1}.


== Discussion ==
== Discussion ==
The remaining issue seems to be how 'point data' are represented. There are two candidate representations for point data:
* CDM/CF-1.6: where a grid/coverage is used where the array has one dimension and there are several maps (example: temperature data - one array with one dimension - has two maps, one for lat and one for lon - each map is one dimension).
* Using Sequence: The same data can be represented using a sequence with three columns (one for temperature and one each for lat and lon).


1) The CDM uses this object model for coordinate systems:
There's no debate about the suitability of each of the above to represent 'point data'.  
 
  [http://www.unidata.ucar.edu/software/netcdf-java/CDM/index.html#CoordSys CDM CoordSys]
 
When translating things like GRIB into CDM, we usually also add the CF attributes, which simplifies things since now the coordsys info is encoded at the data access layer. This is very simple, in CDL:
 
float Temp(z,y,x);
  :coordinates = "lat, lon, depth";
 
2) It appears that a Variable that contains map elements is a "grid", and that when you make a data request for a grid, you get back the corresponding values of the maps. Correct?
 
One problem with that is when you have 2D coordinates, as in:
 
 
float lon(y,x);
float lat(y,x);
float Temp(z,y,x);
  :coordinates = "lat, lon, depth";
 
then you get back 3X more data, which you may not want.
 
--[[User:JohnCaron|JohnCaron]] 15:29, 2 March 2012 (PST)
 
=== Basic features of Grids in DAP4 ===
We're still discussing just how constraining "grids" works. I think that the model we choose needs to support:
* N-dimensional coordinate variables (aka maps)
* Shared dimensions
* subsetting that returns a valid grid
* subsetting that returns parts that make up the grid
 
I think optimizing transfers should be secondary to proper semantics.
 
NB: This is already present in the [[DAP4: Data Model]]
 
[[User:Jimg|Jimg]] 17:11, 2 March 2012 (PST); Updated: [[User:Jimg|Jimg]]
 
=== Regarding Grids and Subsetting ===
I'm going to focus on the problems/issues associated with Grid subsetting. I think we need to focus on this because it's such an important feature of DAP - the ability to subset Grids.
 
Suppose we have two Grids G<sub>1</sub> and G<sub>2</sub> that share two maps M<sub>1</sub> and M<sub>2</sub>. If we subsample  G<sub>1</sub> and G<sub>2</sub> using two different intervals over M<sub>1</sub> and M<sub>2</sub>, then we have aproblem because we need one set of M<sub>1</sub> and M<sub>2</sub>  for G<sub>1</sub> and a different set of M<sub>1</sub> and M<sub>2</sub>  for G<sub>2</sub>, but since we no longer wrap the Grids G<sub>1</sub> and G<sub>2</sub> in their own lexical scope, that's cumbersome at best - some kind of scope has to be introduced into the result that was not previously present.
 
Potential solutions:
===== Add the enclosing Lexical Scope back into the Data Model =====
We can go back to the old data model where each Grid (G<sub>1</sub> and G<sub>2</sub>, e.g.) is wrapped in its own scope. This introduced two problems of its own:
# The scope was 'fictitious' in that while it was part of the DAP representation for the dataset, it was not present in the original;
# The extra scope had semantics associated with it that 'broke' when some subsetting operations were applied (e.g., when just the 'array part' was projected but not the 'maps')
# A side effect of this data model was that the storage/transmission size increased. Personally, while this was (is?) the most common complaint, it is not really such a big deal. With a handful of datasets, this really matters, but for most data I regard it as an optimization.
 
===== Change the way Grid subsetting works =====
We can change the way Grid subsetting works. Grids are define a class of a relational type - the indices of the Maps and Arrays function like foreign keys - but that does not mean that we have to return the Maps with each Grid subset. When a client subsets the two grids (G<sub>1</sub> and G<sub>2</sub>) it gets back only the Arrays that hold their data. This introduces some problems/issues:
# The 'Grid-ness' of G<sub>1</sub> and G<sub>2</sub> has been lost
# Clients must either make a coordinated request for the Maps M<sub>1</sub> and M<sub>2</sub> and we must solve the question of how to make two or ''different'' requests for the same variable (cf namespace issues) or...
# Clients must make two or more requests, one for each Grid and each of those must still contain requests for several variables to be really useful and ...


===== Restrict the kinds of subsetting allowed =====
However, there is debate about how best to ''transport'' and/or ''represent'' this information. That is, given that many systems will store point data in a relational database while many others will adopt CF-1.6 and use arrays to store the data, does adopting this mean that servers (in the aggregate) will provide two different representations for the same kind of data? My ([[User:Jimg|Jimg]]) prediction, based on the past, is that servers will have to be modified to provide the kind of responses different clients expect (rather than the case where clients are written to process each of the representations). This is a function of users/clients tending to cluster around different application areas. It may not matter for within-domain access, but it will hinder cross-domain access.
We can limit the ways in which a Grid can be subset. This might be what Dennis proposes in ... but I'm not sure. In this scheme, for any given request, a set of Grids like G<sub>1</sub> and G<sub>2</sub> that share maps could only be subset using one interval over  M<sub>1</sub> and M<sub>2</sub>. If a client attempts to use two or more intervals, it gets an error. This introduces some problems, too:
# If a client wants to subset of different intervals of the maps, it must make several requests
# Since clients get back 'consistent' responses, a client might, as John says, get back more data than it wants.


==== slightly more formal statement of the third solution ====
We have moved this discussion to the proposal about [[DAP4:_VLens_(and_Sequences) | Sequences and VLens]].
''I thought a more formal version of this might be useful in ferreting out any more problems besides the two already mentioned (which are repeated below)''


Definitions:  
[[User:dmh|Dennis Heimbigner]](5/17/2012)
;Grid: A Grid is an Array variable that has, in addition to it's dimensions which define its rank and extent, a set of maps that provide coordinate space values for the data stored in the Array.
The above proposal allows for map variables that are fields of a dimensioned structure. This means that map names have to be prepared to deal with names like this.
;Map: A Map is an Array that uses a Dimension to define a coordinate space. Maps are Arrays that use Dimensions to define at least some of their extent and are used by Grids.
<pre>/g/S[0].f</pre>
;Dimension: A Dimension is the binding of a name to a size and is used to define the rank and extent of an Array.


* Grid subsetting can only take place using intervals defined over its maps.  
I think this is undesirable and I propose that we do two things.
* When one or more Grids share maps, any given subsetting request can subset one or more of those Grids, but must use the same map interval for them all.
# Distinguish variables from fields; a variable is a top-level decl in a group, a field is a decl within a structure or sequence.
* The result of a Grid subsetting operation includes the tuple of Dimension(s), Map(s) and Array(s) that make up the subset Grids.
# Require that all map variables and all variables with maps be variables and not be allowed to be fields.
* It is possible to subset the 'Array part' of a Grid and get back just the Array.


Problems/issues:
[[User:Jimg|Jimg]] 13:34, 17 May 2012 (PDT)
# If a client wants to subset of different intervals of the maps, it must make several requests
I agree. Allowing Maps and 'Variables with Maps' to be fields is a needless mess. If there is some dataset that is really like that, the DAP layer, which is necessarily an abstraction, can hide those details. My preference in this case is for a bit more complexity on the server side to simplify the logic of clients.
# Since clients get back 'consistent' responses, a client might, as John says, get back more data than it wants.

Latest revision as of 19:54, 31 August 2012

<< Back to OPULS Development

Background

The grid construct as originally established in the DAP2 protocol has been a source of problems from its inception. The evolution of the notion of coordinate variables makes its use in its current form (or even closely similar forms) untenable.

Problems Addressed

Grid as scoping/lexical container

This means that properly sharing coordinate variables is not possible without duplication, which is highly undesirable.

Consider the following situation.

    Arrays: D1(x,y), D2(y,z), D3(x,z).
    coord vars: x(x), y(y), z(z)

No grid, as currently defined can represent this because the three coordinate variables x(x), y(y), and z(z), cannot be properly distributed across needed three grids without duplication. The only way this can work is if all the Arrays and all the coordinate variables reside in a single grid; not, I maintain, a useful solution. Further, the Grid must change if new arrays are defined that use any of the coordinate variables, D4(x,w), for example.

Grid projections

When a projection is applied to a grid, the result cannot be a grid. This has been an ongoing source of problems in DAP2 where projecting the array component of a grid results in a structure. From the point of view of semantics, this is a really bad idea.

Multi-dimensional coordinate variables

When representing point data, it is desirable to have coordinate variables distinguished using more than a single dimension. Consider the following:

    array: temp(x,y,z)
    coordinate vars: lat(x,y,z), lon(x,y,z), and depth(x,y,z).

Here we are trying to represent point data where each point is defined by three dimensions: lat, lon, and depth. Grids are not capable of properly representing this case. I should note that neither is, for example, netcdf-3 or netcdf-4. CDM can do it, by only by encoding the proper relationships as attributes with complex internal structure.

Coordinate Variable Duplication

In examining a large number of DAP2 DDS's, I note that coordinate variables inside grids are almost always duplicated outside the grid. My hypothesis has been that this a result of the fact of problem (1) above. In any case, this proposal below would obviate the need for duplication.

Proposal: Grid as mapping

Rather than making grids be scope containers, grids need to be simple relationship instances between an array and its coordinate variables. This would be done by associating the coordinate variables with an array variable.

Specifically:

  1. The Grid data type in DAP4 should shed the enclosing lexical scope
  2. Grid is a relation that binds one or more coordinate variables (aka maps) to one Array.

Using OGC coverage terminology, we have this.

  1. The maps specify the Domain
  2. The array specifies the Range
  3. The Grid itself is a Coverage
  4. The Domain and Range are sampled functions

There are a number of constraints on the form of maps and their relationship to the Array.

Assume we have a (grid) array of the form

float32 A[d1...,dn]

and associated maps

float32 M1[d1,d2];
float32 M2[d3,d4];

Assume the following definitions:

  • Let {A} be the set of dimensions of A, namely {d1,...dn}.
  • Let {D} be the set of dimensions mentioned in any of the map variables, so in our case above {D} = {M1} union {M2}.
  • Let |A| be the rank of A (n in this case).
  • Let |{...}| be the number of elements in a set.
  • Let {Mi} be the set of maps (= {M1,M2} in this case).

Using these notations, the contraints are as follows.

  1. |Mi| <= |A| : i.e. each map var has a dimension no more than that of the grid array.
  2. {Mi} has no fixed upper bound : i.e. there can be as many maps as desired.
  3. {D} = {A} or {D} is a subset of {A} : i.e. every named dimension mentioned in the map variables must appear in the set of dimensions of A.
  4. |A| = |{A}| : i.e. the dimensions of A may not contain duplicates so A[x,x] is disallowed.
  5. {Mi} is in fact a set, which means that any duplicates are ignored and the order is irrelevant. So {Mi} = {v1,v1,v2} is the same as {m1,m2} is the same as {m2,m1}.

Discussion

The remaining issue seems to be how 'point data' are represented. There are two candidate representations for point data:

  • CDM/CF-1.6: where a grid/coverage is used where the array has one dimension and there are several maps (example: temperature data - one array with one dimension - has two maps, one for lat and one for lon - each map is one dimension).
  • Using Sequence: The same data can be represented using a sequence with three columns (one for temperature and one each for lat and lon).

There's no debate about the suitability of each of the above to represent 'point data'.

However, there is debate about how best to transport and/or represent this information. That is, given that many systems will store point data in a relational database while many others will adopt CF-1.6 and use arrays to store the data, does adopting this mean that servers (in the aggregate) will provide two different representations for the same kind of data? My (Jimg) prediction, based on the past, is that servers will have to be modified to provide the kind of responses different clients expect (rather than the case where clients are written to process each of the representations). This is a function of users/clients tending to cluster around different application areas. It may not matter for within-domain access, but it will hinder cross-domain access.

We have moved this discussion to the proposal about Sequences and VLens.

Dennis Heimbigner(5/17/2012) The above proposal allows for map variables that are fields of a dimensioned structure. This means that map names have to be prepared to deal with names like this.

/g/S[0].f

I think this is undesirable and I propose that we do two things.

  1. Distinguish variables from fields; a variable is a top-level decl in a group, a field is a decl within a structure or sequence.
  2. Require that all map variables and all variables with maps be variables and not be allowed to be fields.

Jimg 13:34, 17 May 2012 (PDT) I agree. Allowing Maps and 'Variables with Maps' to be fields is a needless mess. If there is some dataset that is really like that, the DAP layer, which is necessarily an abstraction, can hide those details. My preference in this case is for a bit more complexity on the server side to simplify the logic of clients.