DAP4: Constraint Expressions, v2: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 377: Line 377:
-->
-->
== Grammar ==
== Grammar ==
 
<!-- 
<pre>
<pre>
<!--
Dennis' original notes
Dennis' original notes
---------------------------------
---------------------------------
Line 749: Line 748:
CE: S1 | S1.S2.y = S1.S3.z
CE: S1 | S1.S2.y = S1.S3.z
</pre>
</pre>
-->


= Discussion =
= Discussion =

Revision as of 23:49, 30 July 2013

<< Back to OPULS Development

Background

At the OPULS meeting in Boulder CO on 1-3 Oct 2012 we discussed how the concepts of the DAP2 constraint expressions could be extended to DAP4. We decided that the existing capabilities available in DAP2 should be included, but with some changes due to the differences in DAP4's data model and to fix problems with the DAP2 syntax and semantics. DAP4 constraint expressions (CEs) will support projection in much the same way that DAP2's CEs did. Variables are chosen from a dataset by listing them in the CE in a comma separate list. We decided that the language and syntax of the selection was flawed and will be replaced by filters that are applied to single clauses in the CE, not to the whole CE as was the case with DAP2. By dropping the database language of relations we are making a distinction that the CE semantics does not intended to support the relational calculus, but does support choosing specific values from different types of variables. Unlike DAP2, these filters can be applied to arrays as well as 'sequences' (which we have technically dropped from the data model in favor of structures with varying dimensions). Lastly, we talked as some length about the relative merits of incorporating a functional programming language into the CE syntax and decided to do so.

Problem Addressed

Remote data access depends on having a simple and powerful query interface. Users must be able to choose parts of a large and often very complex dataset. Unlike most 'Big Data,' the datasets that DAP (and, hence, OPULS) targets are highly structured, and the CE semantics must reflect this. In practice this means that we must be able to choose parts of a dataset by marking some or all of the variables it contains as the ones we would like to access. In addition, it is useful to be able to 'slice' array variables so that a subset, in index space, is returned. In addition, it is often necessary to subset variables by value, accessing all of the elements of an array or structure that are within a certain range.

Proposed Solution

The proposed solution is based on the existing DAP2 constraint expression syntax, with two main modifications that result from differences in the DAP4 data model as well as fixes for ambiguities in the DAP2 CE syntax and semantics. The ambiguities arose from applying a DAP2 selection to the entire CE, which didn't always make sense (so the user or evaluator had to make some arbitrary choices about what a particular selection subexpression meant) and in calling functions on the dataset or in the selection-part of the CE. the latter was less of an ambiguity than an annoyance for users because important metadata about the potential return from a function was not available in the DAP2 CE. Lastly, some subsetting expressions would alter the type of the variable(s) they returned (i.e., Grids could be returned as Structures in some cases) and this behavior was either the cause of much complexity or not handled properly by clients.

Terminology used by this document:

selection expression
The entire expression passed to the server that is used to choose specific parts of a dataset.
subset
The act of choosing parts of a dataset based on the type of one or more of its variables. We define several types of subsetting operations as follows:
index subsetting
Choosing parts of an array based on the indexes of that array's dimensions. This operation always returns an array of the same rank as the original, although the size of the return array will (likely) be smaller. Index subsetting uses the bracket syntax described later.
field subsetting
Choosing specific variables or fields from the dataset. A dataset in DAP4 is made up of a number of variables and those may be Structures or Sequences that contain fields (and, in effect, the Dataset is itself a Structure and all of its variables are fields - the distinction is more convenience than formal). Field subsetting using the brace syntax described later. One or more fields can be specified using a semicolon (;) as the separator.
filter
A filter is a predicate that can be used to choose data elements based on their values. the vertical bar (|) is used as a prefix operator for the filter predicate. Filters can be applied to elements of an Array or fields of a Sequence. A filter predicate consists of one or more filter subexpressions. One or more subexpressions can be specified, using a comma (,) as the separator.
filter subexpression
A simple expression that consists of a single variable/field, a relative operator (=, !=, <, <=, >, >=) for numbers or a comparison operator (=, ~=) for a string (~= is a regex compare).
id
The name of a variable. These can be relative or absolute. Absolute names use the FQN proposal (See ).

Subsetting Constraints

The simplest constraint is the null string and it means 'return everything' from the dataset. Choosing variables in a dataset is referred to as the subset. To choose a subset of the variables in a dataset, enumerate them in a semicolon-separated list. To choose parts of a Structure, name those parts explicitly using the syntax structure name{field name}. Each DAP4 dataset contains one or more Groups; the top-level Group is always present and is named / (pronounced 'root'). If the root Group is the only Group in the dataset, it does not need to be named when listing variables in the CE. However, if there are other Groups in the dataset, each Group other than the root Group must be named. In any case, naming the root Group is optional.

Names are case sensitive.

Example: subsetting by variable or field

Note: The syntax used for the examples is (hopefully) easier to read than the DAP4 DMR which uses XML; Curly braces indicate hierarchy.

Dataset {
    Int32 u;
    Int32 v;
    Structure {
        Int32 x;
        Int32 y;
    } Point;
} projections;

Note: Variable names are case-sensitive.

Access just u
u
Access just u and v
u;v
Access just x within Point
Point{x} This notation is based on the use of brackets in DAP4: Proposal for Structure Projection and DAP4: DAP4 Filter Constraints with the exception that braces ({}) are likely easier to parse than brackets ([]) given that arrays of both Structure and Sequence are possible and thus with arrays of these structures the grammar that defines the constraint expression syntax would become context sensitive.
Access u and v by explicitly naming their Group
/u;/v. Every dataset in DAP4 has a root Group, written /. When that is the only Group in a dataset, it is implicit in the CE, but you can still use its name explicitly.
Dataset {
    Int32 u;
    Int32 v;
    Group {
        Int32 u;
        Int32 v;
	Structure {
	    Int32 x;
	    Int32 y;
	} Point;
   } inst2;
} Explicit_Group;
Access 'top-level' u and v
/u;/v or u,v
Access 'top-level' u and v and inst2's u and v
/u;/v;/inst2/u;/inst2/v, or u;v;/inst2/u;/inst2/v
Access inst2's u and v
/inst2/u;/inst2/v
Access Point 's x, which is inside the inst2 Group
/inst2/Point{x}

Array Subsetting in Index Space

Subsetting fixed-size arrays in their index space is accomplished using square brackets. For an array with N dimensions, N sets of brackets are used, even if the array is only subset on some of the dimensions. The names of array variables are fully qualified names (FQNs) so it's possible to name arrays in structures and/or Groups. Array index values are zero-based as with a number of programming languages such as C and Java. Within the square brackets, several subexpressions are allowed:

[ n ]
return only the value(s) at a single index, where 0 <= n < N for a dimension of size N.
[ ]
return all of elements elements for a particular dimension
[ start : step : stop ]
return every step value between start and stop. This is the complete version of the syntax.
[ start : stop ]
return the values betweenstart and stop (start and stop define a closed interval).
[ : stop ]
return the values from the start of the dimension to stop.
[ start : ]
return the values from start to the end of the dimension.
[  : step: stop ]
return every step' value from the start of the dimension to stop.
[ start : step : ]
return every step' value from start to the end of the dimension.
[ : step : ]
return every step value for the dimension.

The subsetting operator can be applied to one or more arrays in the CE.

Example: Subsetting in Index Space

Dataset {
    Int32 u[256][256];
    Int32 v[256][256];
    Structure {
        Int32 x;
        Int32 y;
    } Point[256];
} arrays;
Access all of u
u
Access all of Point 's x field
Point{x}. This returns an array of Structures with a single (Int32) element, not and array of Int32. Some would Point{x}[] and Point{x}[0:1:255]
Access elements 10 to 20 of array Point
Point[9:19]. DAP4, like DAP2, uses zero-based indexes. This CE will return the 10th to the 20th elements (Structures in this case) of the array
Access every 4th element in the Point array
Point[0:4:255], or Point[:4:]. This is a simple decimation operation; this CE would return 64 Structures corresponding to elements at indexes 0, 3, 7, ..., 255 of the array.
The index-space and field subsetting may be combined in the logical way
Point{x}[:4:] will return an array of structures (with 64 elements) named Point that contains a single Int32 field named x.
Access parts of u and v
u[4:2:9],v[4:2:9]

Other possible CEs:

u[:4:][:4:]
every fourth element in both dimentsions; this would return 1/16^th of the array's data.
u[][9:19]
elements corresponding to every row and columns 10 to 20
u[7][9:19]
elements corresponding to the 8^th row and columns 10 to 20
u[9:19][9:19]
elements corresponding to rows 10 to 20 and colums 10 to 20.
u[:19][:19]
elements corresponding to rows 0 to 20 and columns 0 to 20.
u[][]
identical to u.

More complex subsetting examples

The data model for DAP4 is very similar to that of a modern structured programming language where constructor types like Structure may contain any allowed type, including other Structures and arrays of Structures as well as being arrays themselves. The basic syntax outlined so far for Structures, the selection of fields within a Structure and array dimension subsetting by index can be applied to these 'recursive' types by following the rules laid out in the basic cases. Some examples follow:

Dataset {
    Int32 u[256][1024];
    Structure {
        Int32 x;
        Int32 y[1024];
    } Points[256];
} example;
Points{y[7:256]}
Get all of the elements of the Array of Structure Points and for each of those elements get the elements 7 to 256 from the array y. Do not return the field x.
Points{y[0:9]}[0:9]
Get the first ten elements of Points and, for each of those, only the first ten elements of the array y.
Points[0:9]
Get the first ten elements of Points (both fields are included)
Points
Get all of Points
Dataset {
    Int32 u[256][1024];
    Structure {
        Int32 x;
        Int32 y;
        Structure {
            Int32 height[1024];
            Int32 pressure[1024];
        } sounding;
    } Points[256];
} example;
Points{x,y,sounding{height[:8:]}[0]
Get only the first element of Points, get the fields x, y and sounding but for sounding get only every 8th element of the field height

How Sequences fit into this syntax

The Sequence type has been added back into the DAP4 type system (aka data model) as a way to encode what the Common Data Model (CDM) encodes using varying dimensions for arrays (see DAP4: VLEN proposal). As such, Sequence will be a more general data type than in DAP2 where it was significantly limited. In DAP4 Arrays of Sequences will be allowed as will Sequence elements that are themselves Arrays. Thus, Sequences will be a completely general data type, able to hold instances of any type and able to be instances of any type - of course there are only two constructor types: Sequence and Structure. However, Sequences have not only the projection capabilities but also filtering by value.

The by value filter is indicated by the 'pipe' symbol (|). It provides one or more expressions, separated by commas. The filter is true when all of the subexpressions are true. Like the brace and bracket syntactical elements, symbols in the expressions that are part of the filter may be either FQNs or identifiers in the scope of the variable to which the expression is associated.

Dataset {
    Sequence {
        Int32 x;
        Int32 y;
    } s1;

   Sequence {
        Int32 x;
        Int32 y;
    } s2[100];

    Sequence {
        Int32 x[10];
        Int32 z;
    } s3;

     Sequence {
        Int32 x[1024];
    } s4[100];
} example;
s1
All of Sequence s1.
s1{x,y}
Also all of Sequence s1.
s1{x}
every 'row' of Sequence s1, but just field x.
s1{x,y}|x<7,y<9
All of s1 where the fields x and y satisfy the given filter expression (filtering is covered in detail on a subsequent section of this document).
s2{x,y}
All one hundred elements of the Array s2. Same as s2 and s2{x,y}[0:99].
s2{x,y}[0:9]
The first ten elements of s2. That would be 10 Sequences and for each, both the fields x and y.
s2{x,y}[0:9]|x<8
The first ten elements of s2 where only the 'rows' where x<8 are included.

Subsetting and Shared Dimensions

Shared dimensions provide a way to indicate that two or more variables are share equivalent extents. In DAP2, the type Grid was used for this, but its limitations were significant. In DAP4, maps (aka coordinate variables) can be shared and can be of any dimension (although there are limitations on how they can be used with other maps). For this section, I will extend the notation used so far:

  • The keyword Dimensions introduces a list of symbols and their sizes. (That is the definition of a dimension in DAP4, a size bound to an identifier.)
  • Arrays where every dimension uses a Dimension to supply its extent.
  • Arrays that use parenthesis () in place of braces to indicate the sizes of dimensions and which use the names of maps to do so for at least one dimension.

Example of this syntax

Dataset {
	Dimensions: nlat=10, nlon=5; // the sizes of the dimensions
	Float32 lat[nlat];             // This is a (potential) map
	Float32 lon[nlon];           // As is this 

	Float32 temp(lon)(lat);       // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
        Float32 sal(lon)(lat)
} shared_dimensions;
Coherent Array Subsetting

Coherent array subsetting provides a way to apply the same index subsetting to a set of arrays, as long as they are related in certain ways. To be used in this kind of subsetting, it must be possible to take one set of index subsetting constraints and use it with a set of arrays. If all of the arrays were of the same rank where all their dimensions were the same size, this would be easy. To be used in coherent array subsetting, the arrays must either be maps or grids. If they are grids, each must have exactly the same shape; the same rank, dimension size, order and must use the same maps for the same dimensions. For maps to be used in this subsetting operation, they must be maps used by the girds. There must be at least one grid; there may be zero or more maps. The return type will include relevant Dimension objects if there are maps in the set of arrays. The syntax for this borrows from the brace and bracket notations already introduced.

{temp,lat,lon}[0:9][10:19]
This will return temp[0:9][10:19], lat[0:9], lon[10:19] and Dimensions nlat=10, nlon=10.
{temp,lat,lon,sal}[0:9][10:19]
This will return temp[0:9][10:19], lat[0:9], lon[10:19], sal[0:9][10:19] and Dimensions nlat=10, nlon=10 (i.e., the maps and grids may be listed in any order).

Filters

Whilesubsetting provides ways to choose data based on the structure of a dataset, filters provide a way to choose data based on their values. The values to be returned are denoted using simple predicates. When an array is filtered by value using a filter, the elements of the array that fail the filter predicate(s) will be replaced with a NoData (ND) value. The array may also have am index subset, in which case that will be applied 'before the filter predicates are evaluated for the remaining elements.

The general syntax for a filter expression is to follow a subset expression with a pipe (|) and one or more filter predicates. Multiple predicates are separated by commas and the value of complete predicate is the logical AND of the comma-separated subexpressions.

Filter expressions can be applied to both Array and Sequence variables. In each case the result of the filter operation is a value in the same type variable. A filter applied to an Array returns an array with the same shape (the same rank and the same dimension sizes) but the elements that fail the predicate will be replaces with the variable's No Data value. For an Array, it is possible to supply a value to use for No Data if no such value is part of the dataset's metadata (or that will be used in preference to any such value). A Sequence variable is essentially a table of values and thus can be thought of as containing a number of rows and the filter expression is applied to each row in the order those rows are provided to the expression evaluator. Every row that satisfies the predicate will be included in the value returned; those that don't will be elided from the result.

Example: Filters

Dataset {
    Float64 temp[10];
    Float32 u[256][256];
    Float32 v[256][256];
    Sequence {
        Int32 x;
        Int32 y;
    } Points;
} arrays;
temp|temp<7
Return all ten elements of temp but replace those elements that do not satisfy the predicate temp < 7 with the variable's No Data value.
temp|temp<7,ND=0
Same as the above, but use zero as the value for No Data.
u|u>10,ND=-1e-37
Return u with any element that is not > 10 set to the No Data value of -1 * 10-37.
u|u>10, ND=NaN
Return u with any element that is not > 10 set to the No Data value of Nan.
u|u>10
Assume that the variable u has no missing_value attribute, then this will return u with any element that is not > 10 set to the No Data value of Nan (Nan is the default for No Data value for Float32 and Float64 types).
u[0:2:99][0:2:99]|u>10
This shows how filtering can be combined with subsetting by index using this syntax. The index subsetting is performed first and the resulting 50 X 50 array is filtered. The return value is a 50 x 50 Float32 array with elements <= 10 replaced with NaN (or the variable's missing_value attribute).

Filters and more complex data types

The basic syntax for filters is that there is a subsetting expression a literal | and then one or more filter predicates. This syntax can appear any place a selection expression can appear, so it can be used inside braces when an Array or Sequence is a field of a Structure or Sequence. Some examples follow.

Example: Filters on complex types

Dataset {
    Sequence {
        Int32 x[100];
        Int32 y;
    } Points1;

    Sequence {
        Int32 x;
        Int32 y;
        Sequence {
            Int32 depth;
            Int32 temp;
        } sounding;
    } Points2[20];

    Structure {
        Int32 x[;
        Int32 y;
        Sequence {
            Int32 depth;
            Int32 temps[4];
        } raw;
    } Points3[100]

} complex_types_example;
Points1{x[0:9]|x<10,ND=-127;y}|y<3
For the Sequence Points1, return the rows of data where y is less than 3. In those rows, subset x so that only the first ten elements are included and then filter that so that elements >= 10 are replaced by the No Data value of -127.
Points2[10:20] { x; y; sounding { | depth > 10 } } | 20 < x < 40, y <35
This selection expression first finds the index subset of Points2 and arranges to return the fields x, y and sounding where xand y satisfy the filter predicate. For the field sounding, which is a Sequence itself, it will return both fields and all rows where depth is > 10. This example points out an important aspect to the syntax and to expression evaluation: the order of evaluation of the filter predicates happens after the index and variable and/or field subsetting. The order of evaluation of the complete filter predicates can happen in any order (i.e., the 20 < x < 40, y <35 and depth > 10 predicates can happen in any order. The order of evaluation of the filter predicate subexpressions (i.e., 20 < x < 40 and y <35) is also unspecified. Another way to write this selection expression is...
Points2[10:20] | 20 < x < 40, y <35, sounding.depth > 10
The same result as above...
Points3[3:2:8] {x; y; raw{temps[2]} | temps[2]>7,ND=-1
In this expression the temps field of the Sequence raw is still an Array, it's just an Array with a single element, which illustrates that neither the subsetting nor filtering operations alter the types of the variables.

Filters and grid variables

Example: Filters and grids with vector maps

Example: Filters and grids with N-dimensional maps

Jimg 15:18, 30 July 2013 (PDT) More editing to be done below this point

Grammar


dims: lat=10, lon=5 float temp[lat][lon] float lat[lat] float lon[lon]

CE: i=[0..n],j=[0..m],

  {temp[i][lon],lat[i],lon[j]
   | lon[j]>90&lon[j]<128
    & lat[i]>40&lat[i]<60}

returns:

struct {

  int i
  int j
  float lat
  float lon
  temp

} <name>[*]

versus

struct {

  int i
  float lat

} lat[*]

struct {

  int j
  float lon

} lon[*]

struct {

  int i
  int j
  float temp

} temp[*]


Shared dim "grid" case:

dims: lat=10, lon=5 float temp[lat][lon] maps: lat, lon float lat[lat] float lon[lon]

CE: lat=[*],lon=[*],

  {temp[lat][lon],lat[lat],lon[lon]
   | lon[lon]>90&lon[lon]<128
    & lat[lat]>40&lat[lat]<60}

result:? dim: lat=m lon=n float temp[lat][lon] maps:lat,lon float lat[lat] float lon[lon]


Shared dim "swathe" case: dims: x=10, y=5 float temp[x][y] maps: lat, lon float lat[x][y] float lon[x][y]

CE: x=[*],y=[*],

  {temp[x][y],lat[x][y],lon[x][y]
   | lon[x][y]>90&lon[x][y]<128
    & lat[x][y]>40&lat[x][y]<60}

returns:

Note that [x,y] is ragged, not rectangular.

struct {

  int x
  int y
  lat
  lon
  temp

} <name>[*]

versus struct {int x; int y; float lat;} lat[*] struct {int x; int y; float lon;} lon[*] struct {int x; int y; float temp;} temp[*]


Using function:

f(temp,lat,lon,90,128,40,60)

return:

dims: lat=n, lon=m //m,n computed by looking at lat[] and lon[]
float temp[lat][lon]
float lat[lat]
float lon[lon]

or struct {

float lat;
float lon;
float temp;

} X[*];



Consider the following

i=[0..n],temp[i]|temp[i]==temp[i+1]


i=[0..n],j=[0..m],temp[i][j]|temp[i][j]==temp[j][i]

======================================

{i=[0:5],j=*,S1[i]|S2[j].depth=100.0}

Structure {

int i;
Structure {
  float lat;
  float lon;
  Structure {
    float depth;
    float temp;
  } S2[*];
 } S1;

} S1[*];



=>float temp[*]



float lat[lat]

lat=[0:99] ... lat|(lat[lat] > 7 & lat[lat] < 10)


i=[0:99],lat|(lat[i] > 7 & lat[i] < 10)

i[*1],lat[*2] *1=*2

structure {

 int i[*]
 float lat[*]

} lat;

structure {

 int i
 float lat

} lat[*];


x=10; y=20 float lat[x][y]

lat[i=0:9][j=1:3]|lat[i][j] > 7 & lat[i][j] < 10)

structure {

 int i
 int j
 float lat

} lat[*];

float temp[10][20] float lat[19] float lon[18]

DIMS=i=[0:5],j=[1:2] CE=temp[i][j],lat[i],lon[j]

temp[0:5][2],lat[0:5],lon[2]

lat1=0..5 lon1=2 temp[lon1][lat1],lat[lat1],lon[lon1]

dim: lat1, lon1 temp[lat1][lon1] lat[lat1] lon[lon1]


iterator variable defined new shared dimensions => iterator

CE=<shared dim iterators>{<local iterator>expr>}{...}

similar to ferret, except they use let x=...


relational calculus: {seq1.c1,seq1.c2,seq2.c3,seq3.c4 where seq1.c1 = seq2.c3}

semi-join: {seq1.c1,seq1.c2, where seq1.c1 = seq2.c3}

seq1,seq2|seq1.key1=seq2.key2

Structure {

float lat;
float lon;
float depth;
float temp;

} S1[*];

relational selection: CE: S1|S1.depth >100 returns: struct {lat;lon;depth;temp} S1[*}

relational projection:

CE: S1.(lat,depth) returns: Structure { lat; depth} S1[*]

S1.(lat) struct {lat} S1[*]

S1.lat lat[*]


Structure {

float lat;
float lon;
Structure {
  float depth;
  float temp;
} S2[*];

} S1[*];

If S1[*] CE: j=*,S1[j] | S1.lat>5 CE: S1[j=*] | S1.lat>5

if S1[5] CE: j=*,S1[j] | S1.lat>5

return: Note user is responsible for name conflict with the iterator name. Structure {

int j;
float lat;
float lon;
Structure {
  float depth;
  float temp;
} S2[*];

} S1[*];


Structure {

float lat;
float lon;
Structure {
  float depth;
  float temp;
} S2[*];

} S1[*];

CE: i=*,j=*,S1[i] | S1[i].S2[j].temp > 10 CE: S1[i=*] <where does j go?>| S1[i].S2[j].temp > 10

  should be: S1 | S1.S2.temp > 10

return: Structure {

int i
float lat;
float lon;
Structure {
  int j
  float depth;
  float temp;
} S2[*];

} S1[m];


Structure {

float lat;
float lon;
Structure {
  float depth;
  float temp;
} S2[5];

} S1[10];

CE: S1 | S1.S2.temp > 10

return: Structure {

float lat;
float lon;
Structure {
  float depth;
  float temp;
} S2[*];

} S1[m];


Structure {

int x;
Structure {
  int y;
} S2[*];
Structure {
  int z;
} S3[*]

} S1[m];

?what about: CE: S1 | S1.S2.y = S1.S3.z

-->

Discussion