DAP4: Constraint Expressions, v2: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
mNo edit summary
 
(161 intermediate revisions by 2 users not shown)
Line 3: Line 3:


= Background =
= Background =
At the OPULS meeting in Boulder CO on 1-3 Oct 2012 we discussed how the concepts of the DAP2 constraint expressions could be extended to DAP4. We decided that the existing capabilities available in DAP2 should be included, but with some changes due to the differences in DAP4's data model and to fix problems with the DAP2 syntax and semantics. DAP4 constraint expressions (CEs) will support ''projection'' in much the same way that DAP2's CEs did. Variables are chosen from a dataset by listing them in the CE in a comma separate list. We decided that the language and syntax of the ''selection'' was flawed and will be replaced by ''filters'' that are applied to single clauses in the CE, not to the whole CE as was the case with DAP2. By dropping the database language of ''relations'' we are making a distinction that the CE semantics does not intended to support the relational calculus, but does support choosing specific values from different types of variables. Unlike DAP2, these ''filters'' can be applied to arrays as well as 'sequences' (which we have technically dropped from the data model in favor of structures with varying dimensions). Lastly, we talked as some length about the relative merits of incorporating a functional programming language into the CE syntax and decided to do so.
At the OPULS meeting in Boulder CO on 1-3 Oct 2012 we discussed how the concepts of the DAP2 constraint expressions could be extended to DAP4. We decided that the existing capabilities available in DAP2 should be included, but with some changes due to the differences in DAP4's data model and to fix problems with the DAP2 syntax and semantics. DAP4 constraint expressions (CEs) will support ''projection'' in much the same way that DAP2's CEs did (but we're going to call it ''subsetting''). Variables are chosen from a dataset by listing them in the CE in a semicolon separated list. We decided that the language and syntax of the DAP2 ''selection'' was flawed and will be replaced by ''filters'' that are applied to single clauses in the CE, not to the whole CE as was the case with DAP2. By dropping the database language of ''relations'' we are making a distinction that the CE semantics does not intended to support the relational calculus, but does support choosing specific values from different types of variables. Unlike DAP2, these ''filters'' can be applied to arrays as well as Sequences. Lastly, we talked at some length about the relative merits of incorporating something akin to a functional programming language into the CE syntax and decided to do so.


= Problem Addressed =
= Problem Addressed =
Remote data access depends on having a simple and powerful query interface. Users must be able to choose parts of a large and often very complex dataset. Unlike most 'Big Data,' the datasets that DAP (and, hence, OPULS) targets are highly structured, and the CE semantics must reflect this. In practice this means that we must be able to choose parts of a dataset by marking some or all of the variables it contains as the ones we would like to access. In addition, it is useful to be able to 'slice' array variables so that a subset, in index space, is returned. In addition, it is often necessary to subset variables by value, accessing all of the elements of an array or structure that are within a certain range.
Remote data access depends on having a simple and powerful query interface. Users must be able to choose parts of a large and often very complex dataset. Unlike most 'Big Data,' the datasets that DAP (and, hence, OPULS) target are highly structured, and the CE semantics must reflect this. In practice this means that we must be able to choose parts of a dataset by marking some or all of the variables it contains as the ones we would like to access. In addition, it is useful to be able to 'slice' array variables so that a subset, in index space, is returned. In addition, it is often necessary to subset variables by value, finding all of the elements of an array or sequence that are within a certain range.


= Proposed Solution =
= Proposed Solution =
The proposed solution is based on the existing DAP2 constraint expression syntax, with two main modifications that result from differences in the DAP4 data model as well as fixes for ambiguities in the DAP2 CE syntax and semantics. The ambiguities arose from applying a DAP2 ''selection'' to the entire CE, which didn't always make sense (so the user or evaluator had to make some arbitrary choices about what a particular selection subexpression meant) and in calling functions on the dataset or in the selection-part of the CE. the latter was less of an ambiguity than an annoyance for users because important metadata about the potential return from a function was not available in the DAP2 CE. Lastly, some subsetting expressions would alter the type of the variable(s) they returned (i.e., Grids could be returned as Structures in some cases) and this behavior was either the cause of much complexity or not handled properly by clients.
The DAP4 Constraint Expression (CE) syntax is an extension of the syntax used by DAP2 that adds some important new features for Arrays and Coverages (Grids) as well as addressing some ambiguities in the DAP2 syntax. In this design we also introduce some new terminology to make the explanation of the CE syntax clearer. Additionally, we use a 'curly brace' notation for datasets to streamline the description of datasets because the XML documents that DAP4 servers produce is verbose and hard for humans to read.


Terminology used by this document:
A CE lists which variables and what parts of those variables a server should return to a client. When a client makes a request to a DAP4 server, it always sends a CE where an empty CE is interpreted to mean that the client wants the entire dataset sent. A CE is made up of a list of clauses, each of which names a variable in the dataset that the client would like the server to send to it. Each clause can further be broken down into two parts: The subset expression and the filter expression. There are limitations on the CE clauses depending on variable type. For scalar variables, getting the variable is the only option available, so the subset expression is limited and no filter expression is supported. Structure variables can be subset by field but do not support filter expressions. Sequences can be subset by field and do support filters. Arrays support index subsets and filters. Lastly, ''coverages'' support all of the things Array support but include some special filter operators that build subsets based on values.
 
New ideas/features added for DAP4 include:
* Using a grouping operator for Structures and Sequences
* Filters on Arrays
* Masking for arrays (setting elements in one array to No Data based on the elements set to No Data in another array)
* Computed subsetting for Coverages
* Sequence filtering expressions explicitly bound to a specific Sequence variable
 
DAP4 type system notes:
* Arrays always have fixed dimensions
* Structures and Sequences can contain any type.
* DAP2's Grid is gone; we use explicit Maps and Dimensions to provide a way to denote Coverages, which are more general than DAP2 Grids.
* We provide the data model equivalent of arrays with varying dimensions by combining instances of Arrays and Sequences. This results in new characteristics for DAP4:
** Sequences can contain Arrays
** DAP4 supports Arrays of Sequences
** DAP4 also supports Structures as fields of Sequences
 
== Terminology used by this document ==
;selection expression: The entire expression passed to the server that is used to choose specific parts of a dataset.
;selection expression: The entire expression passed to the server that is used to choose specific parts of a dataset.
;subset: The act of choosing parts of a dataset based on the ''type'' of one or more of its variables. We define several types of subsetting operations as follows:
;subset: The act of choosing parts of a dataset based on the ''type'' of one or more of its variables. We define several types of subsetting operations as follows:
Line 17: Line 35:
;field subsetting: Choosing specific variables or fields from the dataset. A dataset in DAP4 is made up of a number of variables and those may be Structures or Sequences that contain fields (and, in effect, the Dataset is itself a Structure and all of its variables are fields - the distinction is more convenience than formal). Field subsetting using the brace syntax described later. One or more fields can be specified using a semicolon (''';''') as the separator.
;field subsetting: Choosing specific variables or fields from the dataset. A dataset in DAP4 is made up of a number of variables and those may be Structures or Sequences that contain fields (and, in effect, the Dataset is itself a Structure and all of its variables are fields - the distinction is more convenience than formal). Field subsetting using the brace syntax described later. One or more fields can be specified using a semicolon (''';''') as the separator.
;filter: A filter is a predicate that can be used to choose data elements based on their values. the vertical bar ('''|''') is used as a prefix operator for the filter predicate. Filters can be applied to elements of an Array or fields of a Sequence. A filter predicate consists of one or more filter subexpressions. One or more subexpressions can be specified, using a comma (''',''') as the separator.
;filter: A filter is a predicate that can be used to choose data elements based on their values. the vertical bar ('''|''') is used as a prefix operator for the filter predicate. Filters can be applied to elements of an Array or fields of a Sequence. A filter predicate consists of one or more filter subexpressions. One or more subexpressions can be specified, using a comma (''',''') as the separator.
;filter subexpression: A simple expression that consists of a single variable/field, a relative operator (=, !=, <, <=, >, >=) for numbers or a comparison operator (=, ~=) for a string (''~='' is a regex compare).
;filter subexpression: A simple expression that consists of a single variable/field, a relative operator (=, !=, <, <=, >, >=) for numbers or a comparison operator (=, ~=) for a string (''~='' is a regular expression compare) and new operators for Arrays and Coverages (<<, >>, @=).
;id: The name of a variable. These can be relative or absolute. Absolute names use the FQN proposal (See ).
;id: The name of a variable. These can be relative or absolute. Absolute names use the FQN proposal (See ).
;domain, range: A function is a mapping from a  set of ''domain'' values to a set of ''range'' values; in a ''discrete function'', these sets are finite.
;discrete coverage: A ''discrete coverage'' is a (discrete) function where the indices of the arrays that hold the ''domain'' and ''range'' values have a one-to-one and onto mapping, with the important exception that in cases where the dimension of the arrays containing the domain values can be reduced without loosing information, that is done. This is purely an implementation optimization, but where applicable, it is nearly universally used. In DAP4 we call discrete coverages simply ''coverages'' or ''grids''.
;coverage: Synonymous with ''grid'' in this document.


== Subsetting Constraints ==
== Subsetting Constraints ==
The simplest constraint is the null string and it means 'return everything' from the dataset. Choosing variables in a dataset is referred to as the ''subset''. To choose a subset of the variables in a dataset, enumerate them in a semicolon-separated list. To choose parts of a Structure, name those parts explicitly using the syntax ''structure name{field name}''. Each DAP4 dataset contains one or more Groups; the top-level Group is always present and is named ''/'' (pronounced 'root'). If the root Group is the only Group in the dataset, it does not need to be named when listing variables in the CE. However, if there are other Groups in the dataset, each Group other than the root Group must be named. In any case, naming the root Group is optional.
The simplest constraint is the null string and it means 'return everything' from the dataset. Choosing variables in a dataset is referred to as the ''subset''. To choose a subset of the variables in a dataset, enumerate them in a semicolon-separated list. To choose parts of a Structure, name those parts explicitly using the syntax ''structure name{field name}'' or ''structure name.field name''. Each DAP4 dataset contains one or more Groups; the top-level Group is always present and is named ''/'' (pronounced 'root'). If the root Group is the only Group in the dataset, it does not need to be named when listing variables in the CE. However, if there are other Groups in the dataset, each Group other than the root Group must be named. In any case, naming the root Group is optional.


Names are case sensitive.
Names are case sensitive.


==== Example: subsetting by variable or field ====
==== Example: subsetting by variable or field ====
<source lang="xml">
<Dataset name="vol_1_ce_1"
  dapVersion="4.0"
  dmrVersion="1.0"
  xml:base="file:dap4/test_ce_1.xml"
  xmlns="http://xml.opendap.org/ns/DAP/4.0#"
  xmlns:dap="http://xml.opendap.org/ns/DAP/4.0#">
  <Int32 name="u"/>
  <Int32 name="v"/>
  <Structure name="Point">
    <Int32 name="x"/>
    <Int32 name="y"/>
  </Structure>
</Dataset>
</source>
'''Note''': The syntax used for the examples is (hopefully) easier to read than the DAP4 DMR which uses XML; Curly braces indicate hierarchy.
'''Note''': The syntax used for the examples is (hopefully) easier to read than the DAP4 DMR which uses XML; Curly braces indicate hierarchy.
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Line 36: Line 73:
         Int32 y;
         Int32 y;
     } Point;
     } Point;
} projections;
} vol_1_ce_1;
</source>
</source>
'''Note''': Variable names are case-sensitive.


; Access just ''u'': ''u''
; Access just ''u'': ''u''
; Access just ''u'' and ''v'': ''u;v''
; Access just ''u'' and ''v'': ''u;v''
; Access just ''x'' within ''Point'': ''Point{x}'' This notation is based on the use of brackets in [[DAP4: Proposal for Structure Projection]] and [[DAP4: DAP4 Filter Constraints]] with the exception that braces ('''''{}''''') are likely easier to parse than brackets ('''''[]''''') given that arrays of both Structure and Sequence are possible and thus with arrays of these structures the grammar that defines the constraint expression syntax would become context sensitive.
; Access just ''x'' within ''Point'': ''Point{x}'' This notation is based on the use of brackets in [[DAP4: Proposal for Structure Projection]] and [[DAP4: DAP4 Filter Constraints]] with the exception that braces ('''''{}''''') are likely easier to parse than brackets ('''''[]''''') given that arrays of both Structure and Sequence are possible and thus with arrays of these structures the grammar that defines the constraint expression syntax would become context sensitive.
; Equivalent expression to access just ''x'' within ''Point'':  ''Point.x''
; Access ''u'' and ''v'' by explicitly naming their Group: ''/u;/v''. Every dataset in DAP4 has a root Group, written ''/''. When that is the only Group in a dataset, it is implicit in the CE, but you can still use its name explicitly.
; Access ''u'' and ''v'' by explicitly naming their Group: ''/u;/v''. Every dataset in DAP4 has a root Group, written ''/''. When that is the only Group in a dataset, it is implicit in the CE, but you can still use its name explicitly.


<source lang="xml">
<Dataset name="vol_1_ce_2">
  <Int32 name="u"/>
  <Int32 name="v"/>
  <Group name="inst2">
    <Int32 name="u"/>
    <Int32 name="v"/>
    <Structure name="Point">
      <Int32 name="x"/>
      <Int32 name="y"/>
    </Structure>
  </Group>
</Dataset>
</source>
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Line 58: Line 108:
} Point;
} Point;
   } inst2;
   } inst2;
} Explicit_Group;
} vol_1_ce_2;
</source>
</source>


Line 64: Line 114:
; Access  'top-level' ''u'' and ''v'' and ''inst2'''s ''u'' and ''v'': ''/u;/v;/inst2/u;/inst2/v'', or ''u;v;/inst2/u;/inst2/v''
; Access  'top-level' ''u'' and ''v'' and ''inst2'''s ''u'' and ''v'': ''/u;/v;/inst2/u;/inst2/v'', or ''u;v;/inst2/u;/inst2/v''
; Access ''inst2'''s ''u'' and ''v'': ''/inst2/u;/inst2/v''
; Access ''inst2'''s ''u'' and ''v'': ''/inst2/u;/inst2/v''
; Access ''Point'' 's ''x'', which is inside the ''inst2'' Group: ''/inst2/Point{x}''
; Access ''Point'' 's ''x'', which is inside the ''inst2'' Group: ''/inst2/Point{x}'' or ''/int2/Point.x''.
 
'''Notes'''
* Variable names are case-sensitive.
* Using a semicolon is a change from DAP2 where clauses in the ''project part'' of the constraint were separated using a comma ('',''). We used semicolon because the comma is used elsewhere and using comma here made for a convoluted grammar. We wanted the grammar to be LALR(1) so that both table-driven and recursive-descent parsers would be easy to write.because it's easy to make both table and recursive descent parsers for these.
* Every name in a constraint should be a fully qualified name. As a notational simplification, we assume that non-qualified names are actually at the top dataset level (i.e., in the root group). For example, 'u' is really '/u'.


== Array Subsetting in Index Space ==
== Array Subsetting in Index Space ==
Subsetting fixed-size arrays in their ''index space'' is accomplished using square brackets. For an array with ''N'' dimensions, ''N'' sets of brackets are used, even if the array is only subset on some of the dimensions. The names of array variables are fully qualified names (FQNs) so it's possible to name arrays in structures and/or Groups. Array index values are ''zero-based'' as with a number of programming languages such as C and Java. Within the square brackets, several subexpressions are allowed:
Subsetting fixed-size arrays in their ''index space'' is accomplished using square brackets. For an array with ''N'' dimensions, ''N'' sets of brackets are used, even if the array is only subset on some of the dimensions. The names of array variables are fully qualified names (FQNs) so it's possible to name arrays in structures and/or Groups. Array index values are ''zero-based'' as with a number of programming languages such as C and Java. Every array has a known starting index value of zero. Within the square brackets, several subexpressions are allowed:
; [ ''n'' ] : return only the value(s) at a single index, where 0 <= n < N for a dimension of size ''N''.
; [ ''n'' ] : return only the value(s) at a single index, where 0 <= n < N for a dimension of size ''N''. This slicing operator does not reduce the dimensionality of an array, but does return a dimension size of one for the dimension to which this is applied.
; [ ] : return all of elements elements for a particular dimension
; [ ] : return all of elements elements for a particular dimension ''or'' apply a shared dimension slice (more on this later).
; [ ''start'' <nowiki>:</nowiki> ''step'' <nowiki>:</nowiki> ''stop'' ] : return every ''step'' value between ''start'' and ''stop''. This is the complete version of the syntax.
; [ ''start'' <nowiki>:</nowiki> ''step'' <nowiki>:</nowiki> ''stop'' ] : return every ''step'' value between ''start'' and ''stop''. This is the complete version of the syntax.
; [ ''start'' <nowiki>:</nowiki>  ''stop'' ] : return the values between''start'' and ''stop'' (''start'' and ''stop'' define a closed interval).
; [ ''start'' <nowiki>:</nowiki>  ''stop'' ] : return the values between''start'' and ''stop'' (''start'' and ''stop'' define a closed interval).
; [ <nowiki>:</nowiki>  ''stop'' ] : return the values from the start of the dimension to ''stop''.
; [ ''start'' <nowiki>:</nowiki> ] : return the values from ''start'' to the end of the dimension.
; [ ''start'' <nowiki>:</nowiki> ] : return the values from ''start'' to the end of the dimension.
; [  <nowiki>:</nowiki> ''step''<nowiki>:</nowiki>  ''stop'' ] : return every ''step''' value from the start of the dimension to ''stop''.
; [ ''start'' <nowiki>:</nowiki> ''step'' <nowiki>:</nowiki> ] : return every ''step'' value from ''start'' to the end of the dimension.
; [ ''start'' <nowiki>:</nowiki> ''step'' <nowiki>:</nowiki> ] : return every ''step''' value from ''start'' to the end of the dimension.
; [ <nowiki>:</nowiki> ''step'' <nowiki>:</nowiki> ] : return every ''step'' value for the dimension.


The subsetting operator can be applied to one or more arrays in the CE.
The subsetting operator can be applied to any array.


==== Example: Subsetting in Index Space ====
==== Example: Subsetting in Index Space ====
<source lang="xml">
<Dataset name="vol_1_ce_3">
 
  <Int32 name="u">
    <Dim size="256"/>
    <Dim size="256"/>
  </Int32>
  <Int32 name="v">
    <Dim size="256"/>
    <Dim size="256"/>
  </Int32>
  <Structure name="Point">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Dim size="256"/>
  </Structure>
</Dataset>
</source>
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Line 89: Line 159:
         Int32 y;
         Int32 y;
     } Point[256];
     } Point[256];
} arrays;
} vol_1_ce_3;
</source>
</source>


; Access all of ''u'': ''u''
; Access all of ''u'': ''u''
; Access all of ''Point'' 's ''x'' field: ''Point{x}''. This returns an array of Structures with a single (Int32) element, not and array of Int32. Some would ''Point{x}[]'' and ''Point{x}[0:1:255]''
; Access all of ''Point'' 's ''x'' field: ''Point{x}'' or ''Point.x''. This returns an array of Structures with a single (Int32) element, not an array of Int32.
; Access elements 10 to 20 of array ''Point'': ''Point[9:19]''. DAP4, like DAP2, uses zero-based indexes. This CE will return the 10th to the 20th elements (Structures in this case) of the array
; Access elements 10 to 20 of array ''Point'': ''Point[9:19]''. DAP4, like DAP2, uses zero-based indexes. This CE will return the 10th to the 20th elements (Structures in this case) of the array
; Access every 4th element in the ''Point'' array: ''Point[0:4:255]'', or ''Point[:4:]''. This is a simple decimation operation; this CE would return 64 Structures corresponding to elements at indexes 0, 3, 7, ..., 255 of the array.
; Access every 4th element in the ''Point'' array: ''Point[0:4:255]'', or ''Point[0:4:]''. This is a simple decimation operation; this CE would return 64 Structures corresponding to elements at indexes 0, 3, 7, ..., 255 of the array.
; The index-space and field subsetting may be combined in the logical way: ''Point{x}[:4:]'' will return an array of structures (with 64 elements) named ''Point'' that contains a single ''Int32'' field named ''x''.
; The index-space and field subsetting may be combined in the logical way: ''Point[0:4:]{x}'' will return an array of structures (with 64 elements) named ''Point'' that contains a single ''Int32'' field named ''x''. Note that <del>''Point[0:4:].x''</del> is not accepted
; Access parts of ''u'' and ''v'': ''u[4:2:9];v[4:2:9]''
; Access parts of ''u'' and ''v'': ''u[4:2:9];v[4:2:9]''


Other possible CEs:  
Other possible CEs:  
; ''u[:4:][:4:]'': every fourth element in both dimentsions; this would return 1/16^th of the array's data.
; ''u[0:4:][0:4:]'': every fourth element in both dimensions; this would return 1/16<sup>th</sup> of the array's data.
; ''u[][9:19]'': elements corresponding to every row and columns 10 to 20
; ''u[][9:19]'': elements corresponding to every row and columns 10 to 20
; ''u[7][9:19]'': elements corresponding to the 8^th row and columns 10 to 20
; ''u[7][9:19]'': elements corresponding to the 8<sup>th</sup> row and columns 10 to 20
; ''u[9:19][9:19]'': elements corresponding to rows 10 to 20 and colums 10 to 20.
; ''u[9:19][9:19]'': elements corresponding to rows 10 to 20 and columns 10 to 20.
; ''u[:19][:19]'': elements corresponding to rows 0 to 20 and columns 0 to 20.
; ''u[0:19][0:19]'': elements corresponding to rows 0 to 20 and columns 0 to 20.
; ''u[][]'': identical to ''u''.
; ''u[][]'': identical to ''u'', as are ''u[0:][0:]'' and ''u[0:1:][0:1:]''.


==== More complex subsetting examples ====
==== More complex subsetting examples ====
The data model for DAP4 is very similar to that of a modern structured programming language where ''constructor types'' like ''Structure'' may contain any allowed type, including other Structures and arrays of Structures as well as being arrays themselves. The basic syntax outlined so far for Structures, the selection of fields within a Structure and array dimension subsetting by index can be applied to these 'recursive' types by following the rules laid out in the basic cases. Some examples follow:
The data model for DAP4 is very similar to that of a modern structured programming language where ''constructor types'' like ''Structure'' may contain any allowed type (including other Structures, etc.) as well as being arrays themselves. The basic syntax for subsetting outlined so far can be applied to the fields of a Structure using braces to enclose the subsetting expression that apply to the fields of the Structure. This can be applied recursively.
 
<source lang="xml">
<Dataset name="vol_1_ce_4">
  <Int32 name="u">
    <Dim size="256"/>
    <Dim size="1024"/>
  </Int32>
  <Structure name="Point">
    <Int32 name="x"/>
    <Int32 name="y">
      <Dim size="256"/>
    </Int32>
    <Int32 name="z">
      <Dim size="1024"/>
    </Int32>
    <Dim size="256"/>
  </Structure>
</Dataset>
</source>


<source lang="c">
<source lang="c">
Line 116: Line 205:
         Int32 x;
         Int32 x;
         Int32 y[1024];
         Int32 y[1024];
        Int32 z[256];
     } Points[256];
     } Points[256];
} example;
} vol_1_ce_4;
</source>
</source>


; ''Points{y[7:256]}'': Get all of the elements of the Array of Structure ''Points'' and for each of those elements get the elements 7 to 256 from the array ''y''. Do not return the field ''x''.  
; ''Points{y[7:256]}'' or ''Points.y[7:256]'': Get all of the elements of the Array of Structure ''Points'' and for each of those elements get the elements 7 to 256 from the array ''y''. Do not return the field ''x''.
; ''Points{y[0:9]}[0:9]'': Get the first ten elements of ''Points'' and, for each of those, only the  first ten elements of the array ''y''.
; ''Points[0:9]{y[0:9]}'' or ''Points[0:9].y[0:9]'': Get the first ten elements of ''Points'' and, for each of those, only the  first ten elements of the array ''y''.
; ''Points[0:9]{x;y[0:9]}'': Get the first ten elements of ''Points'' and, for each of those, only ''x' and the first ten elements of the array ''y''.
; ''Points[0:9]'': Get the first ten elements of ''Points'' (both fields are included)
; ''Points[0:9]'': Get the first ten elements of ''Points'' (both fields are included)
; ''Points'': Get all of ''Points''
; ''Points'' or ''Points[]'' or ''Points[0:]'': Get all of ''Points'' with the subtle difference that if ''Points'' uses a shared dimension, the last of the three CEs will replace that with an anonymous dimension (see the section on shared dimensions, below).
 
<source lang="xml">
<Dataset name="vol_1_ce_5">
  <Int32 name="u">
    <Dim size="256"/>
    <Dim size="1024"/>
  </Int32>
  <Structure name="Points">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Structure name="sounding">
      <Int32 name="height">
        <Dim size="1024"/>
      </Int32>
      <Int32 name="pressure">
        <Dim size="1024"/>
      </Int32>
    </Structure>
   
    <Dim size="256"/>
  </Structure>
</Dataset>
</source>


<source lang="c">
<source lang="c">
Line 136: Line 250:
         } sounding;
         } sounding;
     } Points[256];
     } Points[256];
} example;
} vol_1_ce_5;
</source>
</source>


; ''Points{x,y,sounding{height[:8:]}[0]'': Get only the first element of ''Points'', get the fields ''x'', ''y'' and ''sounding'' but for ''sounding'' get only every 8<sup>th</sup> element of the field ''height''
; ''Points[0]{x,y,sounding{height[0:8:]}}'': Get only the first element of ''Points'' and, for that, get the fields ''x'', ''y'' and ''sounding'' but for ''sounding'' get only every 8<sup>th</sup> element of the field ''height'' and elide the field ''pressure''. An equivalent way of writing this expression is ''Points[0]{x,y,sounding.height[0:8:]}''. The ''{}'' syntax provides an easy way to request ''x'', ''y'' and ''sounding.height[0:8:]'' without having to repeat ''Points[0]'' three times. A CE like ''Points[0].x;Points[0].y;Points[0].soundings.height[0:8:]'' is legal, but ''Points[0]'' will only appear once in the result and a CE where ''Points'' is sliced differently is not legal. That is, <del>''Points[0].x;Points[0:10].y;Points[15].soundings.height[0:8:]''</del> is not legal because ''Points'' can appear only once in the result but has been sliced three different ways in the CE. In any CE, each variable can be constrained only one way.


=== How Sequences fit into this syntax ===
=== How Sequences fit into this syntax ===
The ''Sequence'' type has been added back into the DAP4 type system (aka data model) as a way to encode what the Common Data Model (CDM) encodes using ''varying dimensions'' for arrays (see DAP4: VLEN proposal). As such, Sequence will be a more general data type than in DAP2 where it was significantly limited. In DAP4 Arrays of Sequences will be allowed as will Sequence elements that are themselves Arrays. Thus, Sequences will be a completely general data type, able to hold instances of any type and able to be instances of any type - of course there are only two constructor types: Sequence and Structure. However, Sequences have not only the ''projection'' capabilities but also filtering by value.
The ''Sequence'' type is more general data type in DAP4 than in DAP2 where it was significantly limited. In DAP4 Arrays of Sequences will be supported as will Sequence elements that are themselves Arrays. A Sequence variable is conceptually like a table where each field in the Sequence is a column in the table (or like an array of Structures, where the size of the single array dimension is a secret). Note that while there is a big difference between the value held by a Structure and a Sequence, each has the same subsetting syntax in the CE (although Sequences may have filters applied while Structures may not).


The ''by value'' filter is indicated by the 'pipe' symbol ('''''|'''''). It provides one or more expressions, separated by commas. The filter is ''true'' when all of the subexpressions are true. Like the brace and bracket syntactical elements, symbols in the expressions that are part of the filter may be either FQNs or identifiers in the scope of the variable to which the expression is associated.
<source lang="xml">
<Dataset name="vol_1_ce_6">
  <Sequence name="s1">
    <Int32 name="x"/>
    <Int32 name="y"/>
  </Sequence>
 
  <Sequence name="s2">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Dim size="100"/>
  </Sequence>
 
  <Sequence name="s3">
    <Int32 name="z"/>
    <Int32 name="x">
      <Dim size="10"/>
    </Int32>
  </Sequence>
 
  <Sequence name="s4">
    <Int32 name="z"/>
    <Int32 name="x">
      <Dim size="1024"/>
    </Int32>
    <Dim size="100"/>
  </Sequence>
 
</Dataset>
</source>


<source lang="c">
<source lang="c">
Line 159: Line 302:


     Sequence {
     Sequence {
        Int32 z;
         Int32 x[10];
         Int32 x[10];
        Int32 z;
     } s3;
     } s3;


     Sequence {
     Sequence {
        Int32 z;
         Int32 x[1024];
         Int32 x[1024];
     } s4[100];
     } s4[100];
Line 171: Line 315:
; ''s1'': All of Sequence ''s1''.
; ''s1'': All of Sequence ''s1''.
; ''s1{x;y}'': Also all of Sequence ''s1''.
; ''s1{x;y}'': Also all of Sequence ''s1''.
; ''s1{x}'': every 'row' of Sequence ''s1'', but just field ''x''.
; ''s1{x}'' or ''s1.x'': every 'row' of Sequence ''s1'', but just field ''x''.
; ''s1{x;y}|x<7,y<9'': All of ''s1'' where the fields ''x'' and ''y'' satisfy the given filter expression (filtering is covered in detail on a subsequent section of this document).
; ''s2{x;y}'': All one hundred elements of the Array ''s2''. Same as ''s2'' and ''s2[0:99]{x,y}'' or ''s2[]{x;y}''.
; ''s2{x;y}'': All one hundred elements of the Array ''s2''. Same as ''s2'' and ''s2{x,y}[0:99]''.
; ''s2[0:9]{x;y}'': The first ten elements of ''s2''. That would be 10 Sequences and for each, both the fields ''x'' and ''y''.
; ''s2{x;y}[0:9]'': The first ten elements of ''s2''. That would be 10 Sequences and for each, both the fields ''x'' and ''y''.
; ''s3{} | z < 10'': Every instance of the Sequence ''s3'' where z is less than 10. Note that this is the first example of a ''filter'', a topic that is discussed in much more detail later on.
; ''s2{x;y}[0:9]|x<8'': The first ten elements of ''s2'' where only the 'rows' where x<8 are included.
 
<!--
=== Anonymous Grouping and Array Subsetting ===
''Anonymous grouping'' provides a way to apply the same index subsetting to a set of arrays, as long as they are related in certain ways. To be used in this kind of subsetting, it must be possible to take one set of index subsetting constraints and use it with a set of arrays and for this to be true, one of two conditions must be met. If all of the arrays in an ''anonymous group'' are of the same ''shape'' (i.e., they have the same rank and each dimension has the same size), then it is trivial to apply one index subset expression to each array and return the result. If, alternatively, the collection of arrays in the anonymous group are related such that they form a ''grid'' then it is also possible to apply the index subset to the set of arrays in a logically consistent way. Note that the term ''anonymous group(ing)'' is riff on the idea that the notation already introduced for Structures and Sequence is a kind of grouping of the named Structure or Sequence's fields.
 
==== Example of anonymous grouping syntax ====
<source lang="c">
Dataset {
    Float32 lat[100][100];
    Float32 lon[100][100];
    Byte SST[100][100];
    Int64 WindSpeedU[100][100];
    Int64 WindSpeedV[100][100];
} example;
</source>
 
; ''{SST;lat;lon}[0:9][10:19]'': This will return ''SST'', ''lat'' and ''lon'' subset as instructed.
; ''{SST;lat;lon;WindSpeedU,WindSpeedV}[0:9][10:19]'': Similarly, this will return the five arrays all subset with the common index subset of ''[0:9][10:19]''.
-->


=== Subsetting and Shared Dimensions ===
=== Subsetting and Shared Dimensions ===
Shared dimensions provide a way to indicate that two or more variables are share equivalent extents. In DAP2, the type Grid was used for this, but its limitations were significant. In DAP4, ''maps'' (aka ''coordinate variables'') can be shared and can be of any dimension (although there are limitations on how they can be used with other maps). For this section, I will extend the notation used so far:
''Shared Dimensions'' provide additional information to indicate that a group of arrays share certain relationships; that specific groups of the arrays form ''coverages''  by indicating how dimensions of ''Maps'' and ''Arrays'' are linked. The DAP4 CE syntax provides a way to slice a Shared Dimension so that slice can be used by all of the ''Maps''/''Arrays'' that use it without repeating the slicing operation for each Map or Array. The syntax can be read 'Assign the shared dimension ''X'' this slice and looks like ''row=[10:19]''. The only difference syntactically between a shared dimension slice and the slice operator applied to a one  dimensional array is the assignment operator (''=''). All of the variations of the slice operator possible for an array are accepted for shared dimension slicing. In any CE, all of the shared dimension slicing clauses must precede the variable subsetting clauses.
* The keyword ''Dimensions'' introduces a list of symbols and their sizes. (That is the definition of a dimension in DAP4, a size bound to an identifier.)
 
* Arrays where every dimension uses a ''Dimension'' to supply its extent.
'''Note''' DAP4 uses XML for it's actual grammar, and because that's wordy this document includes a mock notation. I will extend that notation used so far so it includes concepts needed to mimic DAP4's notation for a coverage:
* Arrays that use parenthesis '''''()''''' in place of braces to indicate the sizes of dimensions and which ''use the names of maps'' to do so ''for at least one dimension''.
* The keyword ''Dimensions'' introduces a list of symbols and their sizes. (That is the definition of a Dimension in DAP4; a size bound to an identifier.)
* Arrays where every dimension uses a ''Dimension'' to supply its extent are ''Maps''. Maps are the arrays that hold the ''domain'' values for a ''coverage''.
* Arrays that use parenthesis '''''()''''' in place of brackets to indicate the sizes of dimensions and which ''use the names of maps'' to do so ''for at least one dimension'' hold the ''range'' values for the coverage. These are the coverage's ''data array'' (aka ''array'' as distinct from ''maps'').
* As with the previous examples, both the official and the mock syntax are shown and these datasets are available from the OPeNDAP test server.


==== Example of this syntax ====
==== Example of this syntax ====
<source lang="xml">
<Dataset name="vol_1_ce_7">
  <Dimension name="nlat" size="100"/>
  <Dimension name="nlon" size="50"/>
 
  <Float32 name="lat">
    <Dim name="nlat"/>
  </Float32>
  <Float32 name="lon">
    <Dim name="nlon"/>
  </Float32>
 
  <Float32 name="temp">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Map name="lat"/>
    <Map name="lon"/>
  </Float32>
  <Float32 name="sal">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Map name="lat"/>
    <Map name="lon"/>
  </Float32>
 
  <Float32 name="O2">
    <Dim name="nlat"/>
    <Dim name="nlon"/>
    <Map name="lon"/>
    <Map name="lat"/>
  </Float32>
 
  <Float32 name="CO2">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Dim size="10"/>
    <Map name="lat"/>
    <Map name="lon"/>
  </Float32>
 
</Dataset>
</source>
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Dimensions: nlat=10, nlon=5; // the sizes of the dimensions
    Dimensions: nlat=100, nlon=50;  
Float32 lat[nlat];             // This is a (potential) map
    Float32 lat[nlat];
Float32 lon[nlon];           // As is this
    Float32 lon[nlon];


Float32 temp(lon)(lat);      // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
    // The maps ''lat'' and ''lon'' are used here and define a coverage
        Float32 sal(lon)(lat)
    Float32 temp(lon)(lat);
    Float32 sal(lon)(lat);
    Float32 O2(lat)(lon);
    Float32 CO2(lon)(lat)[10];
} shared_dimensions;
} shared_dimensions;
</source>
</source>


===== Coherent Array Subsetting =====
==== Examples of the anonymous grouping syntax and coverage subsetting by index ====
Coherent array subsetting provides a way to apply the same index subsetting to a set of arrays, as long as they are related in certain ways. To be used in this kind of subsetting, it must be possible to take one set of index subsetting constraints and use it with a set of arrays. If all of the arrays were of the same rank where all their dimensions were the same size, this would be easy. To be used in coherent array subsetting, the arrays must either be maps or grids. If they are grids, each must have ''exactly'' the same shape; the same rank, dimension size, order and must use the same maps for the same dimensions. For maps to be used in this subsetting operation, they must be maps used by the girds. There must be at least one grid; there may be zero or more maps. The return type will include relevant Dimension objects if there are maps in the set of arrays. The syntax for this borrows from the brace and bracket notations already introduced.
 
; ''nlat=[0:9];nlon=[10:19];lat; lon; temp'': This will return Dimensions nlat=10, nlon = 10, ''lat'', ''lon'' and ''temp'' such that lat an lon are 10 element vectors and ''temp'' is a 10 x 10 array.
; ''nlat=[0:9];nlon=[10:19];lat; lon; temp; sal'': Same as above, but with both ''temp'' and ''sal'' included. This example shows how two or more arrays holding range variables can be accessed along with their Maps without sending multiple copies of the Maps. Similarly, ...
; ''nlat=[0:9];nlon=[10:19];lat; lon'': This CE requests just the arrays that hold the domain values, while ...
; ''nlat=[0:9];nlon=[10:19];temp; sal'': This CE requests just the arrays that hold the range values. Taken together, the two preceding examples support clients that read the domain values first and then display a map (for example) providing a way for someone to view the data's geographical extent before accessing the values them selves. Also note that there is no restriction that the same shared dimension slices must be used for both requests; like DAP2, each request in DAP4 is ''stateless''.
; ''nlat=[0:9];nlon=[10:19];temp[][]; sal[][]'': This CE requests exactly the same data as the previous one, but uses the ''[]'' notation to indicate that the shared dimensions should be used for the subset. An example below shows how this notation can be used to mix local and shared dimension slicing.
; ''nlat=[0:4:];nlon[0:4:];CO2'': This CE decimates ''CO2'' by returning every fourth value in the first two dimensions
; ''nlat=[0:4:];nlon[0:4:];CO2[][][0:4:]'': This CE introduces the second meaning for ''[]''. When the empty braces are used for a dimension that corresponds to a shared dimension, it means ''use the shared dimension slice''. This is useful because some arrays contain a mixture of shared and anonymous dimensions and it's desirable to slice both, using a shared dimension slice previously defined where applicable and an anonymous slice where that's needed. This expression will decimate ''CO2'' by four in each of its three dimensions.
; ''nlat=[0:4:];nlon[0:4:];CO2[][1][0:4:]'': To override the slicing provided by a shared dimension slice, simply replace the ''[]'' with a local dimension slice.
 
== Constrained DMR Objects ==
 
When a DAP4 server receives a request for a Data response, it must build and return a Data Response Document that contains a text/xml part containing a DMR, a separator and a binary part that contains the data values. The organization of the Data Response Document is described in detail elsewhere in this document. In this section the focus is on the DMR returned in the first part of the response and how it relates to the DMR for the original unconstrained dataset. We refer to the original dataset's DMR as the ''DMR'' and the DMR associated with the data response as the ''CDMR'' (short-hand for Constrained DMR), although a data response can be generated using a null CE, we consider that a constraint, too.
 
The DMR contains a number of declarations for the dataset: Enumerations, Dimensions, Attributes, Groups and Variables. Each DMR and CDMR must follow the rules for the DMR described in this specification and, because DAP4 is a stateless protocol, each response from a server must stand on its own. Since a Constraint Expression alters the data returned (limiting variables, changing the size of dimensions and so on), it stands to reason that the contents of the CDMR will vary for any given dataset based on the CE. Furthermore, a goal of DAP4 is to specify that the CDMR be 'minimal' containing no unused definitions.
 
Because filters alter the values of variables, but not whether a variable is returned, they have no affect on the CDMR. Only the subsetting operators will be discussed here.
 
=== Enumerations ===
An enumeration is included in the CDMR only if some variable in the CE references it. A null CE returns the entire dataset, so it effectively references every variable. FIXME Make sure that 'variable' and 'field' are defined correctly in the terms section. jhrg 12/31/13
 
=== Shared Dimensions ===
Shared Dimension declarations from the DMR are not included in the CDMR unless the Shared Dimension is used by a variable that has been projected and that variable does not override that shared dimension using a local slicing operation. FIXME Refer to the DMR from the previous section and show some examples. jhrg 12/31/13
 
=== Variables ===
Each clause in the constraint must specify a variable and that variable will be declared in the CDMR. The variable must be referenced by a FQN unless it is declared at the top level of the DMR and the short-hand notation is used where the leading slash (''/'') is elided.
 
==== Array Variables ====
Array variables follow all the rules for ''Variables'' with the additional conditions that their dimensions may appear altered depending on the CE. If the local slicing operations are used, then the sliced dimensions will have the size given be the slice operator, not the size as shown in the full dataset's DMR. If a shared dimension is sliced and the Array uses that slice, then its size will reflect that. Arrays may mix shared dimension slices and local slices and the result must be correctly reflected in the specific variable's declaration.
 
Note that slicing never affects the ''rank'' of an array.


; ''{temp;lat;lon}[0:9][10:19]'': This will return temp[0:9][10:19], lat[0:9], lon[10:19] and Dimensions nlat=10, nlon=10.
==== Structure Variables ====
; ''{temp;lat;lon;sal}[0:9][10:19]'': This will return temp[0:9][10:19], lat[0:9], lon[10:19], sal[0:9][10:19] and Dimensions nlat=10, nlon=10 (i.e., the maps and grids may be listed in any order).
If the variable is a Structure, then either the entire Structure is included or a subset of its fields will be included in the variable declaration where the fields are those specifically mentioned in a constraint projection. As with all other variables, each variable in the structure will have the same rank and type as the original declaration in the DMR.
 
==== Sequence Variables ====
If the variable is a Sequence, then for declaration
purposes, it is treated like a Structure (as above).
Note that applying a filter to a Sequence will not change
its declaration form because the number of records in the
sequence is not specified in the DMR. Note also that mentioning
a Sequence field in the filter does not necessarily mean it will
be included in the DMR. It will only be included if it is mentioned
in the projection part of the constraint.
 
=== Groups ===
Each declaration in the CDMR that corresponds to a declaration
in the DMR will cause its containing group (and that group's parents)
to be included in the CDMR. This ensures that the FQN for a declaration
in the CDMR is the same as in the DMR.
 
=== Attributes ===
Attributes are unaffected by the CE and are simply included in the CDMR, with the stipulation that attributes for variables that are not included in the CDMR won't be part of the CDMR. Essentially DAP4 views those attributes as part of the variables and explicitly excluding the variable from the CDMR (by providing a CE that does not include it) excludes its attributes too.
 
There is one situation that bears mention, however. Many datasets include attributes that describe domain-specific values for their values. For example, imagine a atmospheric profile that includes information about the minimum and maximum temperatures of that profile. If the values are stored in an array and the array is sliced so that only a subset of values are returned, the attributes will provide correct values for the original data ''but possibly not the data returned in the response'' because the slicing operation has removed some of the values of the array. Because DAP4 is a ''domain neutral'' protocol, it has no knowledge about how the values of a specific attribute relate to the values of the variable and cannot adjust the values of the attribute to match the CE.


==Filters ==
==Filters ==
While''subsetting'' provides ways to choose data based on the structure of a dataset, ''filters'' provide a way to choose data based on their values. The values to be returned are denoted using simple predicates. When an array is ''filtered by value'' using a filter, the elements of the array that fail the filter predicate(s) will be replaced with a ''NoData'' (''ND'') value. The array may also have am index subset, in which case that will be applied 'before'' the filter predicates are evaluated for the remaining elements.  
While ''subsetting'' provides ways to choose data based on the dataset structure and the types of the variables, ''filters'' provide a way to choose data based on their values. The values to be returned are denoted using one or more simple predicates. The general syntax for a filter expression is to follow a subset expression with a pipe ('''|''') and one or more filter predicates. Multiple predicates are separated by commas and the value of complete predicate is the logical AND of the comma-separated subexpressions.  


The general syntax for a filter expression is to follow a subset expression with a pipe ('''|''') and one or more filter predicates. Multiple predicates are separated by commas and the value of complete predicate is the logical AND of the comma-separated subexpressions.  
Filter expressions can be applied to Sequence, Array and Coverage variables. In each case the result of the filter operation returns ''the same type'' variable. A Sequence variable is essentially a table of values and thus can be thought of as containing a number of rows and the filter expression is applied to ''each row'' in the order those rows are provided to the expression evaluator. Every row that satisfies the predicate will be included in the value returned; those that don't will not be included in the result. In each case, the filter is applied to each value of the Array or Sequence and the result returned. Note that no new values are computed by these operations; no interpolations, means, etc., are performed. A filter applied to an Array returns an Array with the same shape (the same rank and the same dimension sizes) but the elements that fail the predicate will be replaced with the variable's ''No Data'' value. The predicate is applied to each element of the array iteratively. For an Array, the value to use for No Data must be provided using the special sub-clause ''ND = value''.
 
In the following, the behavior of filtering expressions on Sequences, Arrays and Coverages with be covered in three separate sections.
 
=== Filters on Sequences and Arrays ===


Filter expressions can be applied to both Array and Sequence variables. In each case the result of the filter operation is a value ''in the same type'' variable. A filter applied to an Array returns an array with the same shape (the same rank and the same dimension sizes) but the elements that fail the predicate will be replaces with the variable's No Data value. For an Array, it is possible to supply a value to use for No Data if no such value is part of the dataset's metadata (or that will be used in preference to any such value). A Sequence variable is essentially a table of values and thus can be thought of as containing a number of rows and the filter expression is applied to ''each row'' in the order those rows are provided to the expression evaluator. Every row that satisfies the predicate will be included in the value returned; those that don't will be elided from the result.


==== Example: Filters ====
==== Example: Filters ====
<source lang="xml">
<Dataset name="vol_1_ce_8">
 
  <Float32 name="temp">
    <Dim size="10"/>
  </Float32>
 
  <Float32 name="u">
    <Dim size="256"/>
    <Dim size="256"/>
  </Float32>
  <Float32 name="v">
    <Dim size="256"/>
    <Dim size="256"/>
  </Float32>
  <Sequence name="Points">
    <Int32 name="x"/>
    <Int32 name="y"/>
  </Sequence>
 
</Dataset>
</source>
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Line 221: Line 511:
</source>
</source>


; ''temp|temp<7'': Return all ten elements of ''temp'' but replace those elements that do not satisfy the predicate ''temp < 7'' with the variable's No Data value.
; ''temp|temp<7,ND=NaN'': Return all ten elements of ''temp'' but replace those elements that do not satisfy the predicate ''temp < 7'' with the ''No Data'' ('''ND''') value.
; ''temp|temp<7,ND=0'': Same as the above, but use zero as the value for No Data.
; ''temp|temp<7,ND=0'': Same as the above, but use zero as the value for No Data.
; ''u|u>10,ND=-1e-37'': Return ''u'' with any element that is not ''> 10'' set to the No Data value of -1 * 10<sup>-37</sup>.
; ''u|u>10,ND=-1e-37'': Return ''u'' with any element that is not ''> 10'' set to the No Data value of -1 * 10<sup>-37</sup>.
; ''u|u>10, ND=NaN'': Return ''u'' with any element that is not ''> 10'' set to the No Data value of Nan.
; ''u|u>10,ND=NaN'': Return ''u'' with any element that is not ''> 10'' set to the No Data value of Nan.
; ''u|u>10'': Assume that the variable ''u'' has no ''missing_value'' attribute, then  this will return ''u'' with any element that is not ''> 10'' set to the No Data value of Nan (Nan is the default for No Data value for Float32 and Float64 types).
; ''u[0:2:99][0:2:99]|u>10,ND=NaN'': This shows how filtering can be combined with subsetting by index using this syntax. The index subsetting is performed first and the resulting 50 X 50 array is filtered. The return value is a 50 x 50 Float32 array with elements <= 10 replaced with NaN.
; ''u[0:2:99][0:2:99]|u>10'': This shows how filtering can be combined with subsetting by index using this syntax. The index subsetting is performed first and the resulting 50 X 50 array is filtered. The return value is a 50 x 50 Float32 array with elements <= 10 replaced with NaN (or the variable's ''missing_value'' attribute).


=== Filters and more complex data types ===
=== Filters and more complex data types ===
The basic syntax for filters is that there is a subsetting expression a literal '''|''' and then one or more filter predicates. This syntax can appear any place a ''selection expression'' can appear, so it can be used inside braces when an Array or Sequence is a field of a Structure or Sequence. Some examples follow.
The basic syntax for filters is that there is a subsetting expression, a pipe ('''|''') and then one or more filter predicates. This syntax can appear any place a ''selection expression'' can appear, so it can be used inside braces when an Array or Sequence is a field of a Structure or Sequence. Note that the filter expression prefix operator binds to the scope of the brace immediately to its left. Some examples follow.


==== Example: Filters on complex types ====
==== Example: Filters on complex types ====
<source lang="xml">
<Dataset name="vol_1_ce_9">
  <Sequence name="Points1">
    <Int32 name="x">
      <Dim size="100"/>
    </Int32>
    <Int32 name="y"/>
  </Sequence>
 
  <Sequence name="Points2">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Sequence name="sounding">
      <Int32 name="depth"/>
      <Int32 name="temp"/>
      <Dim size="20"/>
    </Sequence>
  </Sequence>
 
  <Structure name="Points3">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Sequence name="raw">
      <Int32 name="depth"/>
      <Int32 name="temps">
        <Dim size="4"/>
      </Int32>
      <Dim size="300"/>
    </Sequence>
  </Structure>
 
</Dataset>
</source>
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Line 249: Line 571:


     Structure {
     Structure {
         Int32 x[;
         Int32 x;
         Int32 y;
         Int32 y;
         Sequence {
         Sequence {
Line 260: Line 582:
</source>
</source>


; ''Points1{x[0:9]|x<10,ND=-127;y}|y<3'': For the Sequence ''Points1'', return the rows of data where ''y'' is less than 3. In those rows, subset ''x'' so that only the first ten elements are included and then filter that so that elements >= 10 are replaced by the No Data value of -127.
; ''Points1{x[0:9]|x<10,ND=-127;y}|y<3'': For the Sequence ''Points1'', return the rows of data where ''y'' is less than 3. In those rows, subset ''x'' so that only the first ten elements are included and then filter that so that elements >= 10 are replaced by the No Data value of -127. Note that the filter sub-expression ''y<3'' needs no ''No Data'' ('''ND''') value because it is applied to the field of a Sequence and not the elements of a array.
; ''Points2[10:20] { x; y; sounding { } | depth > 10 } | 20 < x < 40, y <35'': This selection expression first finds the index subset of ''Points2'' and arranges to return the fields ''x'', ''y'' and ''sounding'' where ''x''and ''y'' satisfy the filter predicate. For the field ''sounding'', which is a Sequence itself, it will return both fields and all rows where ''depth'' is > 10. This example points out an important aspect to the syntax and to expression evaluation: the order of evaluation of the filter predicates happens after the index and  variable and/or field subsetting. The order of evaluation of the complete filter predicates can happen in any order (i.e., the ''20 < x < 40, y <35'' and ''depth > 10'' predicates can happen in any order. The order of evaluation of the filter predicate subexpressions (i.e., ''20 < x < 40'' and  ''y <35'') is also unspecified. Another way to write this selection expression is...
; ''Points2[10:19] { x; y; sounding | depth > 10 } | 20 < x < 40, y <35'': This selection expression first finds the index subset of ''Points2'' and arranges to return the fields ''x'', ''y'' and ''sounding'' where ''x''and ''y'' satisfy the predicates ''20 < x < 40'' and ''y <35'' and for the field ''sounding'', which is a Sequence itself, it will return both fields where ''depth > 10''. This example points out an important aspect to the syntax and to expression evaluation: the order of evaluation of the filter predicates happens after the index and  variable and/or field subsetting. The order of evaluation of the complete filter predicates can happen in any order (i.e., the ''20 < x < 40, y <35'' and ''depth > 10'' predicates can happen in any order. The order of evaluation of the filter predicate subexpressions (i.e., ''20 < x < 40'' and  ''y <35'') is also unspecified. Another way to write this selection expression is...
; ''Points2[10:20] | 20 < x < 40, y <35, sounding.depth > 10'': The same result as above...
; ''Points2[10:19] | 20 < x < 40, y <35, sounding.depth > 10'': The same result as above...
; ''Points3[3:2:8] {x; y; raw{temps[2]} | temps[2]>7,ND=-1'': In this expression the ''temps'' field of the Sequence ''raw'' is still an Array, it's just an Array with a single element, which illustrates that neither the subsetting nor filtering operations alter the types of the variables.
; ''Points3[3:2:8] {x; y; raw{temps[2] | temps > 7,ND=-1}}'': In this expression the ''temps'' field of the Sequence ''raw'' is still an Array, it's just an Array with a single element, which illustrates that neither the subsetting nor filtering operations alter the types of the variables.


=== Filters and grid variables ===
<!-- === Filters and predicates that apply only to Arrays or Coverages ===
The ''anonymous grouping'' notation used to apply a uniform set of constraints to ''arrays'' and ''maps'' can be extended to include filters. However, the combination of an array and a set of maps has two kinds of values that can be filtered. A ''discrete coverage'' is a (discrete) function where the arrays called ''maps'' in DAP4 hold the set of ''domain'' values and the array named the ''array'' in DAP4 holds the range values. In DAP4 we call discrete coverages ''grids''. The corresponding indexes of the array and maps express the relation between the domain and range values. Because a ''grid'' essentially has two sets of values, and because selection by value is so important for this data type, there are two kinds of filters that can be applied to a coverage. First, a filtering operation can be applied to the ''array'' part of the ''grid'' and this provides a way to subset a grid using the index subsetting operations and then filter the array to get a masked result. Similarly, a grid's ''maps'' may be filtered (subsequent to index subsetting) and the resulting masked map arrays returned. It is easy to see that these filtering operations fall in line with the ones on 'plain' arrays. Finally, an extra filtering operation is supported for grids where the values of the maps can be 'rubber-banded' using a filter and then the resulting information used to form an index subset for the grid (i.e., both the array and the maps). This operation acknowledges an operation that many users routinely perform. At the same time, the subsetting does not take into account domain-specific aspects of the data in the maps (e.g., taking into account the specific nature of latitude and longitude).
The Mask (''*='') and Rubber band (''<<'', ''>>'') predicates are used to filter and/or subset one array based on the values of another. In each case there are limitations on the two (or more) arrays that are described below. Both of these are intended to be used primarily with coverages where some of the arrays involved are Maps (Arrays that hold the coverage's Domain values) and others hold the coverage's Range values. However, the requirement is only strictly enforced for the Rubber band operators; the Mask operator can be used on two or more arrays if they are isomorphic. -->


==== Example: Simple filters on grids ====
=== The Mask operator may be applied only to an Array ===
For this predicate, the syntax is ''dest *= source'' where any element<sub>'''i,j'''</sub> in the ''source'' array with the value equal to the given ''No Data'' value will result in the corresponding element of the ''dest'' array being set to that array's ''No Data'' value. Other values in the ''dest'' array are not affected. The intent of this is that filtering performed on a map array can then be applied to a data array that uses the map (although there's no actual restriction that this be only used with coverages; any two isomorphic arrays in an anonymous group can be used). To use the Mask (''*='') predicate, the array on the Right Hand Side of the sub-clause must have been previously filtered (and so must have a given ND value).
 
<source lang="xml">
<Dataset name="vol_1_ce_10">
  <Float32 name="lat">
    <Dim size="100"/>
    <Dim size="50"/>
  </Float32>
 
  <Float32 name="lon">
    <Dim size="100"/>
    <Dim size="50"/>
  </Float32>
 
  <Float32 name="temp">
    <Dim size="100"/>
    <Dim size="50"/>
  </Float32>
</Dataset>
</source>
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Dimensions: nlat=100, nlon=50; // the sizes of the dimensions
Float32 lat[100][50];
Float32 lat[nlat];             // This is a (potential) map
Float32 lon[100][50];
Float32 lon[nlon];           // As is this
Float32 temp[100][50];
} vol_1_ce_10;
</source>
 
; ''lat | lat < 20, ND=-255; temp | temp *= lat, ND=-255'': Filter ''lat'' so that all values  < 20 are replaced with the No Data value, then use that as a mask, making all of the corresponding elements of 'temp'' also be No Data. Note that this operator forces an order on filter evaluation.
; ''lat[0:20][0:20] | lat < 20, ND=-255; lon[0:20][0:20] | -100 < lon < -80, ND=-255; temp[0:20][0:20] | temp *= lon, temp *= lat, ND=-255'': This will result in ''temp'' effectively being masked by the logical AND of ''lat'' and ''lon''
 
==== Filters and Coverages ====
The same filtering operations that can be applied to simple arrays can be applied to a coverage by simply using its Maps in the Constraint expression (i.e., the subset and filter sub-expressions). The array filtering operation previously described can easily be applied to a coverage. Below we show two cases; where the maps are vectors and where they are two-dimensional arrays.


Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
<source lang="xml">
<Dataset name="vol_1_ce_11" dapVersion="4.0" dmrVersion="1.0" xml:base="file:dap4/vol_1_ce_11.xml"
  xmlns="http://xml.opendap.org/ns/DAP/4.0#" xmlns:dap="http://xml.opendap.org/ns/DAP/4.0#">
  <Dimension name="nlat" size="100"/>
  <Dimension name="nlon" size="50"/>
 
  <Float32 name="lat">
    <Dim name="nlat"/>
  </Float32>
  <Float32 name="lon">
    <Dim name="nlon"/>
  </Float32>
 
  <Float32 name="temp">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Map name="lon"/>
    <Map name="lat"/>
  </Float32>
 
</Dataset>
</source>
<source lang="c">
Dataset {
Dimensions: nlat=100, nlon=50;
Float32 lat[nlat];
Float32 lon[nlon];
 
Float32 temp(lon)(lat);
} vector_maps;
} vector_maps;
</source>
</source>


; ''{lat;lon;temp} | lat < 20, -100 < lon < -80, ND=-255'':  
; ''lat | lat < 20,ND=-255; lon | -100 < lon < -80, ND=-255'; temp'': Return all of ''lat'', ''lon'' and ''temp'' where ''lat'' and ''lon'' have ben filtered as per the predicates. The values of ''temp'' are not altered.
; ''{lat;lon;temp} [ 0 : 20][0 : 20] | lat < 20, -100 < lon < -80, ND=-255'':  
; ''nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp'': The same as above, but the maps ''lat'' and ''lon'' and the array ''temp'' are subset using the index subsetting expression and the resulting arrays are filtered. Note that because this is a coverage, the Maps' dimensions are tied to ''temp'''s dimensions using Shared Dimensions and we can use the Shared Dimension slicing to specify the slicing once and use it for all three of the Maps/Arrays. That is, ''lat'', ''lon'' and ''temp'' are each subset using ''[0:20][0:20]''. Contrast this with the corresponding example in the previous section where the slicing subset had to be explicitly specified for each Array.
; ''{lat;lon;temp} [ 0 : 20][0 : 20] | temp > 7, lat < 20, -100 < lon < -80, ND=-255'':  
; ''nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp > 7, ND=-255'': Same as above, but now ''temp'' is filtered too. Note that the filters applied to ''lat'' and ''lon'' have no affect on ''temp''.


<source lang="xml">
<Dataset name="vol_1_ce_12" dapVersion="4.0" dmrVersion="1.0" xml:base="file:dap4/vol_1_ce_12.xml"
  xmlns="http://xml.opendap.org/ns/DAP/4.0#" xmlns:dap="http://xml.opendap.org/ns/DAP/4.0#">
  <Dimension name="x" size="100"/>
  <Dimension name="y" size="50"/>
 
  <Float32 name="lat">
    <Dim name="x"/>
    <Dim name="y"/>
  </Float32>
  <Float32 name="lon">
    <Dim name="x"/>
    <Dim name="y"/>
  </Float32>
 
  <Float32 name="temp">
    <Dim name="x"/>
    <Dim name="y"/>
    <Map name="lon"/>
    <Map name="lat"/>
  </Float32>
 
</Dataset>
</source>
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Dimensions: x=100, y=50; // the sizes of the dimensions
Dimensions: x=100, y=50;
Float32 lat[x][y];             // This is a (potential) map
Float32 lat[x][y];
Float32 lon[x][y];           // As is this
Float32 lon[x][y];


Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
Float32 temp(lon)(lat);
} two_dim_maps;
} two_dim_maps;
</source>
</source>


; ''{lat;lon;temp} | lat < 20, -100 < lon < -80, ND=-255'':  
; ''lat | lat < 20,ND=-255; lon | -100 < lon < -80, ND=-255'; temp'': These examples repeat the above three, but show that the same syntax applies to the case where the maps are N-dimensional (in this case N == 2).
; ''{lat;lon;temp} [ 0 : 20][0 : 20] | lat < 20, -100 < lon < -80, ND=-255'':  
; ''nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp'':  
; ''{lat;lon;temp} [ 0 : 20][0 : 20] | temp > 7, lat < 20, -100 < lon < -80, ND=-255'':  
; ''nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp > 7, ND=-255'':
 
<!--
 
I think this should be done using a function so that clients like NetCDF that want to hack around with/on an intermediate result will work as expected.
 
==== The Rubber band predicate for coverages ====
Because a ''coverage'' has two sets of values (domain and range), and because selection by value is so important for this data type, a special filtering operator is included in the grammar applicable only to coverages. The ''Rubber band'' predicate provides a way to filter the domain values of a coverage and use that result to subset the range values. At the same time, this subsetting does not take into account domain-specific aspects of the data in the range values (i.e. the Map). In this kind of filtering, rectangular bounding boxes are derived from the result of applying the filter predicates. For each map, the smallest bounding box that completely encloses all of the elements that satisfy the predicate is formed. Then, for the data array, the ''intersection'' of those bounding boxes is formed and the resulting index subset is applied to all of the variables in the ''anonymous grouping''.


==== Example: Filters that subset grids ====
==== Example: Filters that subset coverages ====
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Dimensions: nlat=100, nlon=50; // the sizes of the dimensions
Dimensions: nlat=100, nlon=50;
Float32 lat[nlat];             // This is a (potential) map
Float32 lat[nlat];
Float32 lon[nlon];           // As is this
Float32 lon[nlon];


Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
Float32 temp(lon)(lat);
} vector_maps;
} vector_maps;
</source>
</source>
Line 310: Line 719:
<source lang="c">
<source lang="c">
Dataset {
Dataset {
Dimensions: x=100, y=50; // the sizes of the dimensions
Dimensions: x=100, y=50;
Float32 lat[x][y];             // This is a (potential) map
Float32 lat[x][y];
Float32 lon[x][y];           // As is this
Float32 lon[x][y];


Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
Float32 temp(lon)(lat);
} two_dim_maps;
} two_dim_maps;
</source>
</source>


; ''{lat;lon;temp} | lat << 20, -110 << lon << -80'':
; ''lat | lat << 20; lon | -110 << lon << -80; temp'': For the coverage consisting of ''lat'', ''lon'' and ''temp'', return the (index) subset of the three arrays by computing that index subset as follows: Find the smallest circumscribing bounding box for ''lat'' such that all values less than 20 are included. Perform the same operations on ''lon'', filtering as for the predicate ''-110 < lon < -80''. Then form the intersection of these bounding boxes and use the resulting bounding box as the index subsets for the shared dimensions. Note that the maps ''lat'' and ''lon'' may be returned with some values that are outside the range of values because the operation returns the circumscribing bounding box. Because these operators subset the shared dimensions, the value returned for ''temp'' will be subset.


==== Example of subsetting (with a filter) a 'projected' lat/lon array ====
'''Note:''' In practice the subsetting operators are evaluated when the CE is first parsed and information from that evaluation is recorded in the DMR. The resulting CDMR is then used to return the result where only the data values to be returned are read from the actual data store. In this case, however, the rubber band predicates are stored in the DMR and evaluated along with other filters at the time the data are sent. This in turn affects the order of evaluation of the CE's parts.
This is a common case: An array holds the range values of some sampled function and two one-dimensional arrays hold the latitude and longitude values. The data are (often) in some standard map projection like Mercator.


'''Note''':  
'''Note''': The '''<<''' and '''>>''' operators can only be used on ''maps''.
# In the notes, the name of the returned structure is usually the name of the first non-iterator variable in the projection/subsetting/filtering clause. What about allowing these 'made up' Structures to be anonymous?
-->
# In this first example, there's quite a bit of redundancy - suppose lat and lon are each 1024. In that case the total size of the arrays ''lat'', ''lon'', and ''temp'' is 1024<sup>2</sup>) + 2(1024) but the size of the return value with a filter is, worst case 3(1024<sup>2</sup>) - essentially it triples the response size in the worst case. A better response would be to return three structures. see below
== Grammar for Bison ==
Shown below is a LALR(1) grammar for the syntax described above.


<source lang="c">
Implementation note: A ''driver'' object is passed to the parser (and so is a ''scanner'' object) which facilitates not only error reporting but also the implementation of ''parser actions''. Given that a DDS/DDX/DMR object holds all of the variables for a given dataset, that can be referenced by the parser's driver object and methods on that driver object can be used to validate the semantics of the actions during the parse process. The same driver object can also be used to record information in that DDS/... object, or the variables it contains, for use later when building a response object/document (i.e., when serializing the variables that have been selected and (possibly) subset and filtered.
Dataset {
 
    Dimensions: lat = 10, lon = 5;
=== Production rules for ''bison'' ===
<source lang="bnf">
%start expression;
 
expression : clauses
| dimensions ";" clauses
;
 
dimensions : dimension
| dimensions ";" dimension
;
 
dimension : id "=" index
;
 
clauses : clause
| clauses ";" clause
;
                   
clause : subset
| subset "|" filter
;
 
// mark_variable returns a BaseType* or throws Error
subset : id
| id indexes
| id fields
| id indexes fields
;
 
indexes : index
| index indexes
;
 
index  : "[" "]"
| "[" WORD "]"
| "[" WORD ":" WORD "]"
| "[" WORD ":" WORD ":" WORD "]"
| "[" WORD ":" "]"
| "[" WORD ":" WORD ":" "]"
;
       
fields : "{" clauses "}"
;
 
filter : predicate
| filter "," predicate
;
 
// Here we use a grammar that is overly general: id op id is not really
// supported by the CE evaluator. However, id op constant, which captures
// the intent of the evaluator design introduces a number of reduce/reduce
// conflicts because any sensible definition of 'constant' will be the
// same as the definition of 'name'. This happens because we must make 'name'
// far more general than ideal (it must include tokens that start with digits
// odd characters that clash with the operators, et cetera). Note that the
// actions here must test for id == "ND" and op == "=", along with a host
// of other checks.
 
predicate : id op id
          | id op id op id
;
 
//          | "ND" "=" id
 
op : "<"
  | ">"
  | "<="
  | ">="
  | "=="
  | "!="
  | "~="
 
  | "<<"
  | ">>"
 
  | "@="
 
  | "="
;
 
id : path
| "/" path
| group "/" path
;
 
group : "/" name
| group "/" name
;


    Float64 lat[lat];
path : name
    Float64 lon[lon];
| path "." name
;


    Float64 temp[lat][lon];
// Because some formats/datasets allow 'any' name for a variable, it is possible
} temp_data;
// that a variable name will be a number, etc. The grammar also allows STRING
// to support "name"."name with spaces and dots (.)".x
name : WORD
| STRING
;
</source>
</source>


;<nowiki>{ i=[],j=[], temp[i][j], lat[i], lon[j| lon[j]>90 & lon[j]<128 & lat[i]>40 & lat[i]<60 }</nowiki>: Returns:
=== Scanner definitions and actions used by the production rules above ===
<source lang="c">
Here are the definitions and actions for a ''flex'' scanner that returns tokens for the above grammar. The quoted strings are very important because in the grammar, ''id''s can be either ''WORD''s or ''STRING''s (i.e., things wrapped in double quotes). This makes it easy to support variable names that include any character (including ones we attach special value to, like dot (.)) without ambiguity or needing to escape them. Only the backslash and double quote need to be escaped. Note that the ''WORD'' definition could be made much tighter, eliminating characters like percent (''%'') but then ''id''s that use those would have to be quoted. I decided to include all the characters in ''WORD'' I could without introducing ambiguity into the actions.
struct {
 
  int i;
(The syntax-dependent display is a bit hosed because of the actions for quoted strings.)
  int j;
<source lang="awk">
  Float64 lat;
/* This pattern ensures that a word does not start with '#' which
  Float64 lon;
    is the DAP2 comment character. */
  Float64 temp;
WORD    [-+a-zA-Z0-9_%*\\~@!][-+a-zA-Z0-9_%*\\#~@!]*
} [*];
 
%%
 
"["    return token::LBRACKET;
"]"    return token::RBRACKET;
":"    return token::COLON;
"," return token::COMMA;
";" return token::SEMICOLON;
"|"    return token::PIPE;
"{" return token::LBRACE;
"}" return token::RBRACE;
"/"    return token::GROUP_SEP;
"."    return token::PATH_SEP;
"="    return token::ASSIGN;
 
"=="    return token::EQUAL;
"!="    return token::NOT_EQUAL;
">" return token::GREATER;
">="    return token::GREATER_EQUAL;
"<"    return token::LESS;
"<="    return token::LESS_EQUAL;
"~="    return token::REGEX_MATCH;
"<<"    return token::LESS_BBOX;
">>"    return token::GREATER_BBOX;
"*="    return token::MASK;
 
[ \t]+  /* ignore these */
 
[\r\n]+ /* ignore these */
 
{WORD} { yylval->build<std::string>(yytext); return token::WORD; }
 
<INITIAL><<EOF>> return token::END;
 
["]   { BEGIN(quote); yymore(); }
 
<quote>[^"\\]*  yymore(); /* Anything that's not a double quote or a backslash */
 
<quote>[\\]["] yymore(); /* This matches the escaped double quote (\") */
 
<quote>[\\]{2}  yymore(); /* This matches an escaped escape (\\) */
 
<quote>[\\]{1}  {
                    BEGIN(INITIAL);
                    if (yytext) {
                        YY_FATAL_ERROR("Inside a string, backslash (\\) can escape a double quote or must itself be escaped (\\\\).");
                    }
                }
 
<quote>["]  {
                /* An unescaped double quote in the 'quote' state indicates the end of the string */
                BEGIN(INITIAL);
                yylval->build<std::string>(yytext);
                return token::STRING;
            }
 
<quote><<EOF>> {
                  BEGIN(INITIAL);   /* resetting the state is needed for reentrant parsers */
                  YY_FATAL_ERROR("Unterminated quote");
                }
 
.  {
        BEGIN(INITIAL);
        if (yytext) {
            YY_FATAL_ERROR("Characters found in the input were not recognized.");
        }
    }
 
</source>
</source>


As mentioned in the notes, this can increase the amount of data returned. A more efficient return form would be:
= Discussion =
<source lang="c">
 
struct {
[[User:dmh|Dennis]] : 9/1/13 '''Lexical Forms'''
  int i;
 
  Float64 lat;
I think that for specification purposes,
} lat[*];
we need to stick to the previously defined
</source>
lexical forms that are defined in section 11
<source lang="c">
of the specification. Introducing any alterations
struct {
in those lexical elements will just cause confusion.
  int j;
::[[User:Jimg|Jimg]] 09:34, 3 September 2013 (PDT) I agree and I'll look over that and try to unify the documents
    Float64 lon;
 
} lon[*];
[[User:dmh|Dennis]] 9/2/13: '''Curly Braces'''
</source>
 
<source lang="c">
The problem that leads to the use of curly
struct {
braces is this. What we are doing it specifying
  int i;
a series of paths from top-level variables
  int j;
of type Structure or Sequence through any nested
  Float64 temp;
Structures or Sequences to a final field
} temp[*];
''S1.S2.x'', ''S1.S2'' etc.
</source>
 
In some (limited?) cases, it would be desirable to
unify those paths so that common prefixes of the paths
need not be repeated. I say limited, because I have no
idea how common such repetions would occur in practice.
In any case, we can unify two paths ''S1.S2.x'' and ''S1.S3'',
say, as ''S1.{S2.x;S3}'' [Note that I am using ".{" instead
of just "{").
::[[User:Jimg|Jimg]] 09:51, 3 September 2013 (PDT) Hmmm. OK.
 
One issue that needs resolution is this.
Suppose, we want to unify S1.S2.x and S1.S2.
Writing ''S1.S2.{x}'' is clearly incorrect. We could write
''S1.{S2;S2.x}'', or perhaps ''S1.S2.{;x}'', or disallow it.
 
In any case, we need to specify some rules, such as,
do we require maximal prefixes or allow less than
maximal prefixes?
::[[User:Jimg|Jimg]] 09:57, 3 September 2013 (PDT) We agreed this would result in an error (s1.{s2,s2.x})
 
I must confess that using curly braces is, IMO,
less than optimal because I would prefer to reserve
it in case we need it for use in filter expressions.
The only alternative I can see is ''&lt;...&gt;'',
and that does not seem attractive.
 
[[User:dmh|Dennis]] 9/2/13: '''Anonymous Grouping'''
 
I am not enamored of this idea. Especially
when used with filters.  My interpretation
is that we are trying to again avoid repeating
information such as filters or dimension subsets.
 
I should note that Ferret addresses the same issue using
a form of ''Let x = ... in ...'' notation and I think
for our purposes some variant of this makes for a cleaner
solution.
 
In line with that I propose that we use a local variable
approach, where we can say things like ''D=0:4:12''
and F=|expr. Then these local variables can be used
later in the constraint.
 
::[[User:Jimg|Jimg]] 09:57, 3 September 2013 (PDT) I don't think we need to introduce local variables and we should avoid that if they are not absolutely needed.
 
[[User:dmh|Dennis]] 9/2/13: '''Variable Binding in Filters '''
 
Given a filter expression ''v|expr'',
the expression part of the filter will contain (at least?, only?)
one variable that must be bound. An issue to be solved is how to
bind all variables in a filter to something in the underlying
dataset, and what restrictions we put on that binding.
 
One example above:
:''Points2[10:20] | 20 < x < 40, y <35, sounding.depth > 10''
illustrates the binding problem. I would argue
this should be written:
:''Points2[10:20] | 20 < Points2.x < 40, Points2.y <35, Points2.sounding.depth > 10'
 
One side note: we should not allow any free variables that are
not bound to something on the immediate left side of the filter.
This means no references to data variables not on the left side
and no reference to outer variables when filters are nested.
::[[User:Jimg|Jimg]] 09:34, 3 September 2013 (PDT) I agree - no free variables and FQNs in the filter subexpressions.


=== Filters applied to 'Swath' data ===
[[User:dmh|Dennis]] 9/2/13: '''Sequence Filtering'''
Satellite Swath data illustrate an important special case for filters because it is a generalization of the syntax to a broader collection of sampled functions (what the OGC calls 'discrete coverages').


<source lang="c">
[Notation: S refers generally to a Structured object,
Dataset {
SQ to a Sequence, and A to an array of objects]
    Dimensions:
    x = 1024;
    y = 1024;
   
    Float64 lat[x][y];
    Float64 lon[x][y];


    Float32 u[x = lon][y = lat];
For a simple sequence, SQ, and given ''SQ|expression''
    Float32 v[x = lon[y = lat];
and assuming we somehow bind SQ to some variable x
} arrays;
in the expression, we need to decide what things inside
</source>
SQ (e.g. ''SQ.f1'') we are allowed to reference in the expression.
We had this discussion a long time ago, and I proposed
that we restrict it to only be able to reference scalar
atomic variables. So ''SQ.S[0].x'' would be disallowed.


In the above example, the syntax ''Float32 u[x = lon][y = lat];'' indicates that ''u'' is a two-dimensional variable in ''x'' and ''y'' but that it holds the range values for ''u'' for a function with the domain given by the maps ''lat'' and ''lon''. Values in ''lat'', ''lon'', and ''u'' are associated with each other by using the same values for the shared dimensions ''x'' and ''y''.
The binding issue is complicated if we want to return
some subportion of a filtered variable. So, we might
want to say ''SQ.{f1;f2}|SQ.f3>5''. This seems ok.


; <nowiki> { i = [], j = [], lat[i][j], lon[i][j], u[i][j], v[i][j] | lat[i][j] < 45 & lat[i][j] > 5 & lon[i][k] > -80 & lon[i][k] < -30 } </nowiki> :
Perhaps more interesting is this: ''S1[1:2].SQ.{f1;f2}| ?.x > 5.
<source lang="c">
Should this be written as ''S1[1:2].SQ.{f1;f2}| S1.SQ.x > 5.
struct {
  int i;
  int j;
  Float64 lat;
  Float64 lon;
  Float64 u;
  Float64 v;
} [*];
</source>
-->


== Grammar ==
[[User:dmh|Dennis]] 9/2/13: '''Array Filtering'''
expression :== selections


selections :== selection | selection ';' selections
When filtering arrays, many of the above issues
for sequences also occur.


selection :== subset | subset '|' filter
Let us assume an array, A and a filter expression
such as  ''A|expression''.


subset :== id | id indexes | id fields | id indexes fields | fields indexes
Suppose that A is of type Structure. Then presumably
we need to decide what fields inside
A (e.g. ''.f1'') we are allowed to reference in the expression.
Presumably, we can use the sequence proposal above
restrict it to only be able to reference scalar
atomic variables. So ''A.S[0].x'' would be disallowed.


indexes :== '[' ']' | [ INT ] | [ INT ':' INT ] | [ INT : INT : INT ] | [ INT : ] | [ INT : INT : ]
Question:  
'''Note''': The document originally included ''[ : INT ]'' and ''[ : INT : INT ]'' but, if our users cannot figure out that the first element of an index that starts at zero is zero ... Including a syntax for 'to the end' is useful since the same expression can be used when the array size changes. This might be useful in aggregations. For the INTs in this rule, no negative values are allowed.
Above, it was indicated that SQ[0:10] is legal, so
if A is Sequence, can we treat it like an array
and do Array filtering instead of tuple filtering?


fields :== '{' selections '}'
The issue of returning subfields of selected elements
of A also occurs. So, can we say
''A.{f1;f2}|A.f3>5''? I think this conflicts with the no-data
idea; perhaps, we need no-data for missing structure fields.


filter :== predicate | predicate ',' filter
[[User:dmh]Dennis]] 9/2/13: '''No Data'''


predicate :== id relop constant | constant relop id | constant relop id relop constant | 'ND' '=' constant
We agreed on the idea of no-data. The issue is specifying it.
'''Note''': The fourth alternative has some context sensitive rules that need to be tested for the parse to succeed: It can only occur for an array and the type of the value must match the type of the array (number is an int32, Float64, et c.; string to String or Url) and there can only be one instance per array.
The ''ND='' idea is good. However, the discussion above
seems to be implicitly assuming we are only return atomic variables.
What if what we are returning is a whole Structure possibly containing
nested structures?
::[[User:Jimg|Jimg]] 09:43, 3 September 2013 (PDT) Yes, at the point where I thought about filters on arrays of non-atomic types, I punted.  


constant :== INT | FLOAT | STRING
[[User:dmh|Dennis]] 9/2/13: '''@='''


relop :== '=' | '!=' | '<' | '<=' | '>' | '>=' | '~='
I need to think about this, but questions concern
'''Note''': The relop '''~=''' is a regex match and applies only to string types.
the situation when the map is more than 1-dimenensional.
I think this could better be handled using the local variable
proposal above.
::[[User:Jimg|Jimg]] 09:46, 3 September 2013 (PDT) I think there's an important point I didn't make in the part about this operator: I want the return type to be the same as the 'input type'. So if an Int32 a[1024][1024] is filtered using @=, then you get an Int32 [1024][1024] out. If the users wants a subset of the array to be filtered, they ask for that by subsetting the array and then filtering the result.


id :== groups vars | vars
[[User:dmh|Dennis]] 9/2/13: '''Restricting the use of filters'''


groups :== '/' | '/' name groups
After thinking about it, I believe we
should put serious limits (for now)
on where filters can be used.
Specifically, I propose the following.
# No nested filters - one filter is allowed for each clause (a clause is what is separated by semicolons)
# The top level var in the clause is the bound variable of the filter and its name is used as the variable inside the filter.
::[[User:Jimg|Jimg]] 11:10, 3 September 2013 (PDT) #1 I agree
::[[User:Jimg|Jimg]] 11:10, 3 September 2013 (PDT) #2 I agree; explanation: each clause has a top level name and that name is scope of the filter expression.


vars :== name | name '.' vars
IMO, Some consequences:
# No anonymous grouping
# scratch my named filters idea above (but keep the name range idea).b


name :== WORD | STRING
The reason I propose this
'''Note''': A STRING is text, including spaces, dots and slashes and pretty much anything else, wrapped in double quotes. Double quotes in the content of the STRING can be escaped by a backslash ('''\''') and, yes, the backslash can be escaped by itself. ...if we have to ;-)
is that our language is getting
too complicated and I want to see
how it implements before making it
more complex.


= Discussion =


<!-- 
<!-- 

Latest revision as of 20:13, 6 January 2014

<< Back to OPULS Development

Background

At the OPULS meeting in Boulder CO on 1-3 Oct 2012 we discussed how the concepts of the DAP2 constraint expressions could be extended to DAP4. We decided that the existing capabilities available in DAP2 should be included, but with some changes due to the differences in DAP4's data model and to fix problems with the DAP2 syntax and semantics. DAP4 constraint expressions (CEs) will support projection in much the same way that DAP2's CEs did (but we're going to call it subsetting). Variables are chosen from a dataset by listing them in the CE in a semicolon separated list. We decided that the language and syntax of the DAP2 selection was flawed and will be replaced by filters that are applied to single clauses in the CE, not to the whole CE as was the case with DAP2. By dropping the database language of relations we are making a distinction that the CE semantics does not intended to support the relational calculus, but does support choosing specific values from different types of variables. Unlike DAP2, these filters can be applied to arrays as well as Sequences. Lastly, we talked at some length about the relative merits of incorporating something akin to a functional programming language into the CE syntax and decided to do so.

Problem Addressed

Remote data access depends on having a simple and powerful query interface. Users must be able to choose parts of a large and often very complex dataset. Unlike most 'Big Data,' the datasets that DAP (and, hence, OPULS) target are highly structured, and the CE semantics must reflect this. In practice this means that we must be able to choose parts of a dataset by marking some or all of the variables it contains as the ones we would like to access. In addition, it is useful to be able to 'slice' array variables so that a subset, in index space, is returned. In addition, it is often necessary to subset variables by value, finding all of the elements of an array or sequence that are within a certain range.

Proposed Solution

The DAP4 Constraint Expression (CE) syntax is an extension of the syntax used by DAP2 that adds some important new features for Arrays and Coverages (Grids) as well as addressing some ambiguities in the DAP2 syntax. In this design we also introduce some new terminology to make the explanation of the CE syntax clearer. Additionally, we use a 'curly brace' notation for datasets to streamline the description of datasets because the XML documents that DAP4 servers produce is verbose and hard for humans to read.

A CE lists which variables and what parts of those variables a server should return to a client. When a client makes a request to a DAP4 server, it always sends a CE where an empty CE is interpreted to mean that the client wants the entire dataset sent. A CE is made up of a list of clauses, each of which names a variable in the dataset that the client would like the server to send to it. Each clause can further be broken down into two parts: The subset expression and the filter expression. There are limitations on the CE clauses depending on variable type. For scalar variables, getting the variable is the only option available, so the subset expression is limited and no filter expression is supported. Structure variables can be subset by field but do not support filter expressions. Sequences can be subset by field and do support filters. Arrays support index subsets and filters. Lastly, coverages support all of the things Array support but include some special filter operators that build subsets based on values.

New ideas/features added for DAP4 include:

  • Using a grouping operator for Structures and Sequences
  • Filters on Arrays
  • Masking for arrays (setting elements in one array to No Data based on the elements set to No Data in another array)
  • Computed subsetting for Coverages
  • Sequence filtering expressions explicitly bound to a specific Sequence variable

DAP4 type system notes:

  • Arrays always have fixed dimensions
  • Structures and Sequences can contain any type.
  • DAP2's Grid is gone; we use explicit Maps and Dimensions to provide a way to denote Coverages, which are more general than DAP2 Grids.
  • We provide the data model equivalent of arrays with varying dimensions by combining instances of Arrays and Sequences. This results in new characteristics for DAP4:
    • Sequences can contain Arrays
    • DAP4 supports Arrays of Sequences
    • DAP4 also supports Structures as fields of Sequences

Terminology used by this document

selection expression
The entire expression passed to the server that is used to choose specific parts of a dataset.
subset
The act of choosing parts of a dataset based on the type of one or more of its variables. We define several types of subsetting operations as follows:
index subsetting
Choosing parts of an array based on the indexes of that array's dimensions. This operation always returns an array of the same rank as the original, although the size of the return array will (likely) be smaller. Index subsetting uses the bracket syntax described later.
field subsetting
Choosing specific variables or fields from the dataset. A dataset in DAP4 is made up of a number of variables and those may be Structures or Sequences that contain fields (and, in effect, the Dataset is itself a Structure and all of its variables are fields - the distinction is more convenience than formal). Field subsetting using the brace syntax described later. One or more fields can be specified using a semicolon (;) as the separator.
filter
A filter is a predicate that can be used to choose data elements based on their values. the vertical bar (|) is used as a prefix operator for the filter predicate. Filters can be applied to elements of an Array or fields of a Sequence. A filter predicate consists of one or more filter subexpressions. One or more subexpressions can be specified, using a comma (,) as the separator.
filter subexpression
A simple expression that consists of a single variable/field, a relative operator (=, !=, <, <=, >, >=) for numbers or a comparison operator (=, ~=) for a string (~= is a regular expression compare) and new operators for Arrays and Coverages (<<, >>, @=).
id
The name of a variable. These can be relative or absolute. Absolute names use the FQN proposal (See ).
domain, range
A function is a mapping from a set of domain values to a set of range values; in a discrete function, these sets are finite.
discrete coverage
A discrete coverage is a (discrete) function where the indices of the arrays that hold the domain and range values have a one-to-one and onto mapping, with the important exception that in cases where the dimension of the arrays containing the domain values can be reduced without loosing information, that is done. This is purely an implementation optimization, but where applicable, it is nearly universally used. In DAP4 we call discrete coverages simply coverages or grids.
coverage
Synonymous with grid in this document.

Subsetting Constraints

The simplest constraint is the null string and it means 'return everything' from the dataset. Choosing variables in a dataset is referred to as the subset. To choose a subset of the variables in a dataset, enumerate them in a semicolon-separated list. To choose parts of a Structure, name those parts explicitly using the syntax structure name{field name} or structure name.field name. Each DAP4 dataset contains one or more Groups; the top-level Group is always present and is named / (pronounced 'root'). If the root Group is the only Group in the dataset, it does not need to be named when listing variables in the CE. However, if there are other Groups in the dataset, each Group other than the root Group must be named. In any case, naming the root Group is optional.

Names are case sensitive.

Example: subsetting by variable or field

<Dataset name="vol_1_ce_1" 
  dapVersion="4.0" 
  dmrVersion="1.0" 
  xml:base="file:dap4/test_ce_1.xml"
  xmlns="http://xml.opendap.org/ns/DAP/4.0#"
  xmlns:dap="http://xml.opendap.org/ns/DAP/4.0#">

  <Int32 name="u"/>
  <Int32 name="v"/>
  <Structure name="Point">
    <Int32 name="x"/>
    <Int32 name="y"/>
  </Structure>

</Dataset>

Note: The syntax used for the examples is (hopefully) easier to read than the DAP4 DMR which uses XML; Curly braces indicate hierarchy.

Dataset {
    Int32 u;
    Int32 v;
    Structure {
        Int32 x;
        Int32 y;
    } Point;
} vol_1_ce_1;
Access just u
u
Access just u and v
u;v
Access just x within Point
Point{x} This notation is based on the use of brackets in DAP4: Proposal for Structure Projection and DAP4: DAP4 Filter Constraints with the exception that braces ({}) are likely easier to parse than brackets ([]) given that arrays of both Structure and Sequence are possible and thus with arrays of these structures the grammar that defines the constraint expression syntax would become context sensitive.
Equivalent expression to access just x within Point
Point.x
Access u and v by explicitly naming their Group
/u;/v. Every dataset in DAP4 has a root Group, written /. When that is the only Group in a dataset, it is implicit in the CE, but you can still use its name explicitly.
<Dataset name="vol_1_ce_2">
  <Int32 name="u"/>
  <Int32 name="v"/>
  <Group name="inst2">
    <Int32 name="u"/>
    <Int32 name="v"/>
    <Structure name="Point">
      <Int32 name="x"/>
      <Int32 name="y"/>
    </Structure>
  </Group>
</Dataset>
Dataset {
    Int32 u;
    Int32 v;
    Group {
        Int32 u;
        Int32 v;
	Structure {
	    Int32 x;
	    Int32 y;
	} Point;
   } inst2;
} vol_1_ce_2;
Access 'top-level' u and v
/u;/v or u,v
Access 'top-level' u and v and inst2's u and v
/u;/v;/inst2/u;/inst2/v, or u;v;/inst2/u;/inst2/v
Access inst2's u and v
/inst2/u;/inst2/v
Access Point 's x, which is inside the inst2 Group
/inst2/Point{x} or /int2/Point.x.

Notes

  • Variable names are case-sensitive.
  • Using a semicolon is a change from DAP2 where clauses in the project part of the constraint were separated using a comma (,). We used semicolon because the comma is used elsewhere and using comma here made for a convoluted grammar. We wanted the grammar to be LALR(1) so that both table-driven and recursive-descent parsers would be easy to write.because it's easy to make both table and recursive descent parsers for these.
  • Every name in a constraint should be a fully qualified name. As a notational simplification, we assume that non-qualified names are actually at the top dataset level (i.e., in the root group). For example, 'u' is really '/u'.

Array Subsetting in Index Space

Subsetting fixed-size arrays in their index space is accomplished using square brackets. For an array with N dimensions, N sets of brackets are used, even if the array is only subset on some of the dimensions. The names of array variables are fully qualified names (FQNs) so it's possible to name arrays in structures and/or Groups. Array index values are zero-based as with a number of programming languages such as C and Java. Every array has a known starting index value of zero. Within the square brackets, several subexpressions are allowed:

[ n ]
return only the value(s) at a single index, where 0 <= n < N for a dimension of size N. This slicing operator does not reduce the dimensionality of an array, but does return a dimension size of one for the dimension to which this is applied.
[ ]
return all of elements elements for a particular dimension or apply a shared dimension slice (more on this later).
[ start : step : stop ]
return every step value between start and stop. This is the complete version of the syntax.
[ start : stop ]
return the values betweenstart and stop (start and stop define a closed interval).
[ start : ]
return the values from start to the end of the dimension.
[ start : step : ]
return every step value from start to the end of the dimension.

The subsetting operator can be applied to any array.

Example: Subsetting in Index Space

<Dataset name="vol_1_ce_3">
  
  <Int32 name="u">
    <Dim size="256"/>
    <Dim size="256"/>
  </Int32>
  <Int32 name="v">
    <Dim size="256"/>
    <Dim size="256"/>
  </Int32>
  <Structure name="Point">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Dim size="256"/>
  </Structure>
</Dataset>
Dataset {
    Int32 u[256][256];
    Int32 v[256][256];
    Structure {
        Int32 x;
        Int32 y;
    } Point[256];
} vol_1_ce_3;
Access all of u
u
Access all of Point 's x field
Point{x} or Point.x. This returns an array of Structures with a single (Int32) element, not an array of Int32.
Access elements 10 to 20 of array Point
Point[9:19]. DAP4, like DAP2, uses zero-based indexes. This CE will return the 10th to the 20th elements (Structures in this case) of the array
Access every 4th element in the Point array
Point[0:4:255], or Point[0:4:]. This is a simple decimation operation; this CE would return 64 Structures corresponding to elements at indexes 0, 3, 7, ..., 255 of the array.
The index-space and field subsetting may be combined in the logical way
Point[0:4:]{x} will return an array of structures (with 64 elements) named Point that contains a single Int32 field named x. Note that Point[0:4:].x is not accepted
Access parts of u and v
u[4:2:9];v[4:2:9]

Other possible CEs:

u[0:4:][0:4:]
every fourth element in both dimensions; this would return 1/16th of the array's data.
u[][9:19]
elements corresponding to every row and columns 10 to 20
u[7][9:19]
elements corresponding to the 8th row and columns 10 to 20
u[9:19][9:19]
elements corresponding to rows 10 to 20 and columns 10 to 20.
u[0:19][0:19]
elements corresponding to rows 0 to 20 and columns 0 to 20.
u[][]
identical to u, as are u[0:][0:] and u[0:1:][0:1:].

More complex subsetting examples

The data model for DAP4 is very similar to that of a modern structured programming language where constructor types like Structure may contain any allowed type (including other Structures, etc.) as well as being arrays themselves. The basic syntax for subsetting outlined so far can be applied to the fields of a Structure using braces to enclose the subsetting expression that apply to the fields of the Structure. This can be applied recursively.

<Dataset name="vol_1_ce_4">
  <Int32 name="u">
    <Dim size="256"/>
    <Dim size="1024"/>
  </Int32>
  <Structure name="Point">
    <Int32 name="x"/>
    <Int32 name="y">
      <Dim size="256"/>
    </Int32>
    <Int32 name="z">
      <Dim size="1024"/>
    </Int32>
    <Dim size="256"/>
  </Structure>
</Dataset>
Dataset {
    Int32 u[256][1024];
    Structure {
        Int32 x;
        Int32 y[1024];
        Int32 z[256];
    } Points[256];
} vol_1_ce_4;
Points{y[7:256]} or Points.y[7:256]
Get all of the elements of the Array of Structure Points and for each of those elements get the elements 7 to 256 from the array y. Do not return the field x.
Points[0:9]{y[0:9]} or Points[0:9].y[0:9]
Get the first ten elements of Points and, for each of those, only the first ten elements of the array y.
Points[0:9]{x;y[0:9]}
Get the first ten elements of Points and, for each of those, only x' and the first ten elements of the array y.
Points[0:9]
Get the first ten elements of Points (both fields are included)
Points or Points[] or Points[0:]
Get all of Points with the subtle difference that if Points uses a shared dimension, the last of the three CEs will replace that with an anonymous dimension (see the section on shared dimensions, below).
<Dataset name="vol_1_ce_5">
  <Int32 name="u">
    <Dim size="256"/>
    <Dim size="1024"/>
  </Int32>
  <Structure name="Points">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Structure name="sounding">
      <Int32 name="height">
        <Dim size="1024"/>
      </Int32>
      <Int32 name="pressure">
        <Dim size="1024"/>
      </Int32>
    </Structure>
    
    <Dim size="256"/>
  </Structure>
</Dataset>
Dataset {
    Int32 u[256][1024];
    Structure {
        Int32 x;
        Int32 y;
        Structure {
            Int32 height[1024];
            Int32 pressure[1024];
        } sounding;
    } Points[256];
} vol_1_ce_5;
Points[0]{x,y,sounding{height[0:8:]}}
Get only the first element of Points and, for that, get the fields x, y and sounding but for sounding get only every 8th element of the field height and elide the field pressure. An equivalent way of writing this expression is Points[0]{x,y,sounding.height[0:8:]}. The {} syntax provides an easy way to request x, y and sounding.height[0:8:] without having to repeat Points[0] three times. A CE like Points[0].x;Points[0].y;Points[0].soundings.height[0:8:] is legal, but Points[0] will only appear once in the result and a CE where Points is sliced differently is not legal. That is, Points[0].x;Points[0:10].y;Points[15].soundings.height[0:8:] is not legal because Points can appear only once in the result but has been sliced three different ways in the CE. In any CE, each variable can be constrained only one way.

How Sequences fit into this syntax

The Sequence type is more general data type in DAP4 than in DAP2 where it was significantly limited. In DAP4 Arrays of Sequences will be supported as will Sequence elements that are themselves Arrays. A Sequence variable is conceptually like a table where each field in the Sequence is a column in the table (or like an array of Structures, where the size of the single array dimension is a secret). Note that while there is a big difference between the value held by a Structure and a Sequence, each has the same subsetting syntax in the CE (although Sequences may have filters applied while Structures may not).

<Dataset name="vol_1_ce_6">
  <Sequence name="s1">
    <Int32 name="x"/>
    <Int32 name="y"/>
  </Sequence>
  
  <Sequence name="s2">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Dim size="100"/>
  </Sequence>
  
  <Sequence name="s3">
    <Int32 name="z"/>
    <Int32 name="x">
      <Dim size="10"/>
    </Int32>
  </Sequence>
  
  <Sequence name="s4">
    <Int32 name="z"/>
    <Int32 name="x">
      <Dim size="1024"/>
    </Int32>
    <Dim size="100"/>
  </Sequence>
  
</Dataset>
Dataset {
    Sequence {
        Int32 x;
        Int32 y;
    } s1;

   Sequence {
        Int32 x;
        Int32 y;
    } s2[100];

    Sequence {
        Int32 z;
        Int32 x[10];
    } s3;

     Sequence {
        Int32 z;
        Int32 x[1024];
    } s4[100];
} example;
s1
All of Sequence s1.
s1{x;y}
Also all of Sequence s1.
s1{x} or s1.x
every 'row' of Sequence s1, but just field x.
s2{x;y}
All one hundred elements of the Array s2. Same as s2 and s2[0:99]{x,y} or s2[]{x;y}.
s2[0:9]{x;y}
The first ten elements of s2. That would be 10 Sequences and for each, both the fields x and y.
s3{} | z < 10
Every instance of the Sequence s3 where z is less than 10. Note that this is the first example of a filter, a topic that is discussed in much more detail later on.


Subsetting and Shared Dimensions

Shared Dimensions provide additional information to indicate that a group of arrays share certain relationships; that specific groups of the arrays form coverages by indicating how dimensions of Maps and Arrays are linked. The DAP4 CE syntax provides a way to slice a Shared Dimension so that slice can be used by all of the Maps/Arrays that use it without repeating the slicing operation for each Map or Array. The syntax can be read 'Assign the shared dimension X this slice and looks like row=[10:19]. The only difference syntactically between a shared dimension slice and the slice operator applied to a one dimensional array is the assignment operator (=). All of the variations of the slice operator possible for an array are accepted for shared dimension slicing. In any CE, all of the shared dimension slicing clauses must precede the variable subsetting clauses.

Note DAP4 uses XML for it's actual grammar, and because that's wordy this document includes a mock notation. I will extend that notation used so far so it includes concepts needed to mimic DAP4's notation for a coverage:

  • The keyword Dimensions introduces a list of symbols and their sizes. (That is the definition of a Dimension in DAP4; a size bound to an identifier.)
  • Arrays where every dimension uses a Dimension to supply its extent are Maps. Maps are the arrays that hold the domain values for a coverage.
  • Arrays that use parenthesis () in place of brackets to indicate the sizes of dimensions and which use the names of maps to do so for at least one dimension hold the range values for the coverage. These are the coverage's data array (aka array as distinct from maps).
  • As with the previous examples, both the official and the mock syntax are shown and these datasets are available from the OPeNDAP test server.

Example of this syntax

<Dataset name="vol_1_ce_7">
  <Dimension name="nlat" size="100"/>
  <Dimension name="nlon" size="50"/>
  
  <Float32 name="lat">
    <Dim name="nlat"/>
  </Float32>
  <Float32 name="lon">
    <Dim name="nlon"/>
  </Float32>
  
  <Float32 name="temp">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Map name="lat"/>
    <Map name="lon"/>
  </Float32>

  <Float32 name="sal">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Map name="lat"/>
    <Map name="lon"/>
  </Float32>
  
  <Float32 name="O2">
    <Dim name="nlat"/>
    <Dim name="nlon"/>
    <Map name="lon"/>
    <Map name="lat"/>
  </Float32>
  
  <Float32 name="CO2">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Dim size="10"/>
    <Map name="lat"/>
    <Map name="lon"/>
  </Float32>
  
</Dataset>
Dataset {
    Dimensions: nlat=100, nlon=50; 
    Float32 lat[nlat];
    Float32 lon[nlon];

    // The maps ''lat'' and ''lon'' are used here and define a coverage
    Float32 temp(lon)(lat);
    Float32 sal(lon)(lat);
    Float32 O2(lat)(lon);
    Float32 CO2(lon)(lat)[10];
} shared_dimensions;

Examples of the anonymous grouping syntax and coverage subsetting by index

nlat=[0:9];nlon=[10:19];lat; lon; temp
This will return Dimensions nlat=10, nlon = 10, lat, lon and temp such that lat an lon are 10 element vectors and temp is a 10 x 10 array.
nlat=[0:9];nlon=[10:19];lat; lon; temp; sal
Same as above, but with both temp and sal included. This example shows how two or more arrays holding range variables can be accessed along with their Maps without sending multiple copies of the Maps. Similarly, ...
nlat=[0:9];nlon=[10:19];lat; lon
This CE requests just the arrays that hold the domain values, while ...
nlat=[0:9];nlon=[10:19];temp; sal
This CE requests just the arrays that hold the range values. Taken together, the two preceding examples support clients that read the domain values first and then display a map (for example) providing a way for someone to view the data's geographical extent before accessing the values them selves. Also note that there is no restriction that the same shared dimension slices must be used for both requests; like DAP2, each request in DAP4 is stateless.
nlat=[0:9];nlon=[10:19];temp[][]; sal[][]
This CE requests exactly the same data as the previous one, but uses the [] notation to indicate that the shared dimensions should be used for the subset. An example below shows how this notation can be used to mix local and shared dimension slicing.
nlat=[0:4:];nlon[0:4:];CO2
This CE decimates CO2 by returning every fourth value in the first two dimensions
nlat=[0:4:];nlon[0:4:];CO2[][][0:4:]
This CE introduces the second meaning for []. When the empty braces are used for a dimension that corresponds to a shared dimension, it means use the shared dimension slice. This is useful because some arrays contain a mixture of shared and anonymous dimensions and it's desirable to slice both, using a shared dimension slice previously defined where applicable and an anonymous slice where that's needed. This expression will decimate CO2 by four in each of its three dimensions.
nlat=[0:4:];nlon[0:4:];CO2[][1][0:4:]
To override the slicing provided by a shared dimension slice, simply replace the [] with a local dimension slice.

Constrained DMR Objects

When a DAP4 server receives a request for a Data response, it must build and return a Data Response Document that contains a text/xml part containing a DMR, a separator and a binary part that contains the data values. The organization of the Data Response Document is described in detail elsewhere in this document. In this section the focus is on the DMR returned in the first part of the response and how it relates to the DMR for the original unconstrained dataset. We refer to the original dataset's DMR as the DMR and the DMR associated with the data response as the CDMR (short-hand for Constrained DMR), although a data response can be generated using a null CE, we consider that a constraint, too.

The DMR contains a number of declarations for the dataset: Enumerations, Dimensions, Attributes, Groups and Variables. Each DMR and CDMR must follow the rules for the DMR described in this specification and, because DAP4 is a stateless protocol, each response from a server must stand on its own. Since a Constraint Expression alters the data returned (limiting variables, changing the size of dimensions and so on), it stands to reason that the contents of the CDMR will vary for any given dataset based on the CE. Furthermore, a goal of DAP4 is to specify that the CDMR be 'minimal' containing no unused definitions.

Because filters alter the values of variables, but not whether a variable is returned, they have no affect on the CDMR. Only the subsetting operators will be discussed here.

Enumerations

An enumeration is included in the CDMR only if some variable in the CE references it. A null CE returns the entire dataset, so it effectively references every variable. FIXME Make sure that 'variable' and 'field' are defined correctly in the terms section. jhrg 12/31/13

Shared Dimensions

Shared Dimension declarations from the DMR are not included in the CDMR unless the Shared Dimension is used by a variable that has been projected and that variable does not override that shared dimension using a local slicing operation. FIXME Refer to the DMR from the previous section and show some examples. jhrg 12/31/13

Variables

Each clause in the constraint must specify a variable and that variable will be declared in the CDMR. The variable must be referenced by a FQN unless it is declared at the top level of the DMR and the short-hand notation is used where the leading slash (/) is elided.

Array Variables

Array variables follow all the rules for Variables with the additional conditions that their dimensions may appear altered depending on the CE. If the local slicing operations are used, then the sliced dimensions will have the size given be the slice operator, not the size as shown in the full dataset's DMR. If a shared dimension is sliced and the Array uses that slice, then its size will reflect that. Arrays may mix shared dimension slices and local slices and the result must be correctly reflected in the specific variable's declaration.

Note that slicing never affects the rank of an array.

Structure Variables

If the variable is a Structure, then either the entire Structure is included or a subset of its fields will be included in the variable declaration where the fields are those specifically mentioned in a constraint projection. As with all other variables, each variable in the structure will have the same rank and type as the original declaration in the DMR.

Sequence Variables

If the variable is a Sequence, then for declaration purposes, it is treated like a Structure (as above). Note that applying a filter to a Sequence will not change its declaration form because the number of records in the sequence is not specified in the DMR. Note also that mentioning a Sequence field in the filter does not necessarily mean it will be included in the DMR. It will only be included if it is mentioned in the projection part of the constraint.

Groups

Each declaration in the CDMR that corresponds to a declaration in the DMR will cause its containing group (and that group's parents) to be included in the CDMR. This ensures that the FQN for a declaration in the CDMR is the same as in the DMR.

Attributes

Attributes are unaffected by the CE and are simply included in the CDMR, with the stipulation that attributes for variables that are not included in the CDMR won't be part of the CDMR. Essentially DAP4 views those attributes as part of the variables and explicitly excluding the variable from the CDMR (by providing a CE that does not include it) excludes its attributes too.

There is one situation that bears mention, however. Many datasets include attributes that describe domain-specific values for their values. For example, imagine a atmospheric profile that includes information about the minimum and maximum temperatures of that profile. If the values are stored in an array and the array is sliced so that only a subset of values are returned, the attributes will provide correct values for the original data but possibly not the data returned in the response because the slicing operation has removed some of the values of the array. Because DAP4 is a domain neutral protocol, it has no knowledge about how the values of a specific attribute relate to the values of the variable and cannot adjust the values of the attribute to match the CE.

Filters

While subsetting provides ways to choose data based on the dataset structure and the types of the variables, filters provide a way to choose data based on their values. The values to be returned are denoted using one or more simple predicates. The general syntax for a filter expression is to follow a subset expression with a pipe (|) and one or more filter predicates. Multiple predicates are separated by commas and the value of complete predicate is the logical AND of the comma-separated subexpressions.

Filter expressions can be applied to Sequence, Array and Coverage variables. In each case the result of the filter operation returns the same type variable. A Sequence variable is essentially a table of values and thus can be thought of as containing a number of rows and the filter expression is applied to each row in the order those rows are provided to the expression evaluator. Every row that satisfies the predicate will be included in the value returned; those that don't will not be included in the result. In each case, the filter is applied to each value of the Array or Sequence and the result returned. Note that no new values are computed by these operations; no interpolations, means, etc., are performed. A filter applied to an Array returns an Array with the same shape (the same rank and the same dimension sizes) but the elements that fail the predicate will be replaced with the variable's No Data value. The predicate is applied to each element of the array iteratively. For an Array, the value to use for No Data must be provided using the special sub-clause ND = value.

In the following, the behavior of filtering expressions on Sequences, Arrays and Coverages with be covered in three separate sections.

Filters on Sequences and Arrays

Example: Filters

<Dataset name="vol_1_ce_8">
  
  <Float32 name="temp">
    <Dim size="10"/>
  </Float32>
  
  <Float32 name="u">
    <Dim size="256"/>
    <Dim size="256"/>
  </Float32>

  <Float32 name="v">
    <Dim size="256"/>
    <Dim size="256"/>
  </Float32>

  <Sequence name="Points">
    <Int32 name="x"/>
    <Int32 name="y"/>
  </Sequence>
  
</Dataset>
Dataset {
    Float64 temp[10];
    Float32 u[256][256];
    Float32 v[256][256];
    Sequence {
        Int32 x;
        Int32 y;
    } Points;
} arrays;
temp|temp<7,ND=NaN
Return all ten elements of temp but replace those elements that do not satisfy the predicate temp < 7 with the No Data (ND) value.
temp|temp<7,ND=0
Same as the above, but use zero as the value for No Data.
u|u>10,ND=-1e-37
Return u with any element that is not > 10 set to the No Data value of -1 * 10-37.
u|u>10,ND=NaN
Return u with any element that is not > 10 set to the No Data value of Nan.
u[0:2:99][0:2:99]|u>10,ND=NaN
This shows how filtering can be combined with subsetting by index using this syntax. The index subsetting is performed first and the resulting 50 X 50 array is filtered. The return value is a 50 x 50 Float32 array with elements <= 10 replaced with NaN.

Filters and more complex data types

The basic syntax for filters is that there is a subsetting expression, a pipe (|) and then one or more filter predicates. This syntax can appear any place a selection expression can appear, so it can be used inside braces when an Array or Sequence is a field of a Structure or Sequence. Note that the filter expression prefix operator binds to the scope of the brace immediately to its left. Some examples follow.

Example: Filters on complex types

<Dataset name="vol_1_ce_9">
  <Sequence name="Points1">
    <Int32 name="x">
      <Dim size="100"/>
    </Int32>
    <Int32 name="y"/>
  </Sequence>
  
  <Sequence name="Points2">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Sequence name="sounding">
      <Int32 name="depth"/>
      <Int32 name="temp"/>
      <Dim size="20"/>
    </Sequence>
  </Sequence>
  
  <Structure name="Points3">
    <Int32 name="x"/>
    <Int32 name="y"/>
    <Sequence name="raw">
      <Int32 name="depth"/>
      <Int32 name="temps">
        <Dim size="4"/>
      </Int32>
      <Dim size="300"/>
    </Sequence>
  </Structure>
  
</Dataset>
Dataset {
    Sequence {
        Int32 x[100];
        Int32 y;
    } Points1;

    Sequence {
        Int32 x;
        Int32 y;
        Sequence {
            Int32 depth;
            Int32 temp;
        } sounding;
    } Points2[20];

    Structure {
        Int32 x;
        Int32 y;
        Sequence {
            Int32 depth;
            Int32 temps[4];
        } raw;
    } Points3[100]

} complex_types_example;
Points1{x[0:9]|x<10,ND=-127;y}|y<3
For the Sequence Points1, return the rows of data where y is less than 3. In those rows, subset x so that only the first ten elements are included and then filter that so that elements >= 10 are replaced by the No Data value of -127. Note that the filter sub-expression y<3 needs no No Data (ND) value because it is applied to the field of a Sequence and not the elements of a array.
Points2[10:19] { x; y; sounding | depth > 10 } | 20 < x < 40, y <35
This selection expression first finds the index subset of Points2 and arranges to return the fields x, y and sounding where xand y satisfy the predicates 20 < x < 40 and y <35 and for the field sounding, which is a Sequence itself, it will return both fields where depth > 10. This example points out an important aspect to the syntax and to expression evaluation: the order of evaluation of the filter predicates happens after the index and variable and/or field subsetting. The order of evaluation of the complete filter predicates can happen in any order (i.e., the 20 < x < 40, y <35 and depth > 10 predicates can happen in any order. The order of evaluation of the filter predicate subexpressions (i.e., 20 < x < 40 and y <35) is also unspecified. Another way to write this selection expression is...
Points2[10:19] | 20 < x < 40, y <35, sounding.depth > 10
The same result as above...
Points3[3:2:8] {x; y; raw{temps[2] | temps > 7,ND=-1}}
In this expression the temps field of the Sequence raw is still an Array, it's just an Array with a single element, which illustrates that neither the subsetting nor filtering operations alter the types of the variables.


The Mask operator may be applied only to an Array

For this predicate, the syntax is dest *= source where any elementi,j in the source array with the value equal to the given No Data value will result in the corresponding element of the dest array being set to that array's No Data value. Other values in the dest array are not affected. The intent of this is that filtering performed on a map array can then be applied to a data array that uses the map (although there's no actual restriction that this be only used with coverages; any two isomorphic arrays in an anonymous group can be used). To use the Mask (*=) predicate, the array on the Right Hand Side of the sub-clause must have been previously filtered (and so must have a given ND value).

<Dataset name="vol_1_ce_10">
  <Float32 name="lat">
    <Dim size="100"/>
    <Dim size="50"/>
  </Float32>
  
  <Float32 name="lon">
    <Dim size="100"/>
    <Dim size="50"/>
  </Float32>
  
  <Float32 name="temp">
    <Dim size="100"/>
    <Dim size="50"/>
  </Float32>
</Dataset>
Dataset {
	Float32 lat[100][50];
	Float32 lon[100][50];
	Float32 temp[100][50];
} vol_1_ce_10;
lat | lat < 20, ND=-255; temp | temp *= lat, ND=-255
Filter lat so that all values < 20 are replaced with the No Data value, then use that as a mask, making all of the corresponding elements of 'temp also be No Data. Note that this operator forces an order on filter evaluation.
lat[0:20][0:20] | lat < 20, ND=-255; lon[0:20][0:20] | -100 < lon < -80, ND=-255; temp[0:20][0:20] | temp *= lon, temp *= lat, ND=-255
This will result in temp effectively being masked by the logical AND of lat and lon

Filters and Coverages

The same filtering operations that can be applied to simple arrays can be applied to a coverage by simply using its Maps in the Constraint expression (i.e., the subset and filter sub-expressions). The array filtering operation previously described can easily be applied to a coverage. Below we show two cases; where the maps are vectors and where they are two-dimensional arrays.

<Dataset name="vol_1_ce_11" dapVersion="4.0" dmrVersion="1.0" xml:base="file:dap4/vol_1_ce_11.xml"
  xmlns="http://xml.opendap.org/ns/DAP/4.0#" xmlns:dap="http://xml.opendap.org/ns/DAP/4.0#">
  <Dimension name="nlat" size="100"/>
  <Dimension name="nlon" size="50"/>
  
  <Float32 name="lat">
    <Dim name="nlat"/>
  </Float32>
  <Float32 name="lon">
    <Dim name="nlon"/>
  </Float32>
  
  <Float32 name="temp">
    <Dim name="nlon"/>
    <Dim name="nlat"/>
    <Map name="lon"/>
    <Map name="lat"/>
  </Float32>
  
</Dataset>
Dataset {
	Dimensions: nlat=100, nlon=50; 
	Float32 lat[nlat];
	Float32 lon[nlon];

	Float32 temp(lon)(lat);
} vector_maps;
lat | lat < 20,ND=-255; lon | -100 < lon < -80, ND=-255'; temp
Return all of lat, lon and temp where lat and lon have ben filtered as per the predicates. The values of temp are not altered.
nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp
The same as above, but the maps lat and lon and the array temp are subset using the index subsetting expression and the resulting arrays are filtered. Note that because this is a coverage, the Maps' dimensions are tied to temp's dimensions using Shared Dimensions and we can use the Shared Dimension slicing to specify the slicing once and use it for all three of the Maps/Arrays. That is, lat, lon and temp are each subset using [0:20][0:20]. Contrast this with the corresponding example in the previous section where the slicing subset had to be explicitly specified for each Array.
nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp > 7, ND=-255
Same as above, but now temp is filtered too. Note that the filters applied to lat and lon have no affect on temp.
<Dataset name="vol_1_ce_12" dapVersion="4.0" dmrVersion="1.0" xml:base="file:dap4/vol_1_ce_12.xml"
  xmlns="http://xml.opendap.org/ns/DAP/4.0#" xmlns:dap="http://xml.opendap.org/ns/DAP/4.0#">
  <Dimension name="x" size="100"/>
  <Dimension name="y" size="50"/>
  
  <Float32 name="lat">
    <Dim name="x"/>
    <Dim name="y"/>
  </Float32>
  <Float32 name="lon">
    <Dim name="x"/>
    <Dim name="y"/>
  </Float32>
  
  <Float32 name="temp">
    <Dim name="x"/>
    <Dim name="y"/>
    <Map name="lon"/>
    <Map name="lat"/>
  </Float32>
  
</Dataset>
Dataset {
	Dimensions: x=100, y=50;
	Float32 lat[x][y];
	Float32 lon[x][y];

	Float32 temp(lon)(lat);
} two_dim_maps;
lat | lat < 20,ND=-255; lon | -100 < lon < -80, ND=-255'; temp
These examples repeat the above three, but show that the same syntax applies to the case where the maps are N-dimensional (in this case N == 2).
nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp
nlat=[0:20]; nlon=[0:20]; lat | lat < 20, ND=-255; lon | -100 < lon < -80, ND=-255'; temp > 7, ND=-255

Grammar for Bison

Shown below is a LALR(1) grammar for the syntax described above.

Implementation note: A driver object is passed to the parser (and so is a scanner object) which facilitates not only error reporting but also the implementation of parser actions. Given that a DDS/DDX/DMR object holds all of the variables for a given dataset, that can be referenced by the parser's driver object and methods on that driver object can be used to validate the semantics of the actions during the parse process. The same driver object can also be used to record information in that DDS/... object, or the variables it contains, for use later when building a response object/document (i.e., when serializing the variables that have been selected and (possibly) subset and filtered.

Production rules for bison

%start expression;

expression : clauses
| dimensions ";" clauses 
;

dimensions : dimension
| dimensions ";" dimension
;

dimension : id "=" index
;

clauses : clause
| clauses ";" clause
;
                    
clause : subset
| subset "|" filter
;

// mark_variable returns a BaseType* or throws Error
subset : id 
| id indexes 
| id fields 
| id indexes fields 
;

indexes : index 
| index indexes
;
   
index   : "[" "]"
| "[" WORD "]"
| "[" WORD ":" WORD "]"
| "[" WORD ":" WORD ":" WORD "]"
| "[" WORD ":" "]"
| "[" WORD ":" WORD ":" "]"
;
        
fields : "{" clauses "}"
;

filter : predicate 
| filter "," predicate
;

// Here we use a grammar that is overly general: id op id is not really
// supported by the CE evaluator. However, id op constant, which captures
// the intent of the evaluator design introduces a number of reduce/reduce
// conflicts because any sensible definition of 'constant' will be the
// same as the definition of 'name'. This happens because we must make 'name'
// far more general than ideal (it must include tokens that start with digits
// odd characters that clash with the operators, et cetera). Note that the
// actions here must test for id == "ND" and op == "=", along with a host
// of other checks.

predicate : id op id
          | id op id op id
;

//           | "ND" "=" id

op : "<"
   | ">"
   | "<="
   | ">="
   | "=="
   | "!="
   | "~="

   | "<<"
   | ">>"

   | "@="
   
   | "="
;

id : path
| "/" path
| group "/" path
;

group : "/" name
| group "/" name
;

path : name 
| path "." name
;

// Because some formats/datasets allow 'any' name for a variable, it is possible
// that a variable name will be a number, etc. The grammar also allows STRING
// to support "name"."name with spaces and dots (.)".x
name : WORD 
| STRING 
;

Scanner definitions and actions used by the production rules above

Here are the definitions and actions for a flex scanner that returns tokens for the above grammar. The quoted strings are very important because in the grammar, ids can be either WORDs or STRINGs (i.e., things wrapped in double quotes). This makes it easy to support variable names that include any character (including ones we attach special value to, like dot (.)) without ambiguity or needing to escape them. Only the backslash and double quote need to be escaped. Note that the WORD definition could be made much tighter, eliminating characters like percent (%) but then ids that use those would have to be quoted. I decided to include all the characters in WORD I could without introducing ambiguity into the actions.

(The syntax-dependent display is a bit hosed because of the actions for quoted strings.)

/* This pattern ensures that a word does not start with '#' which
    is the DAP2 comment character. */
WORD    [-+a-zA-Z0-9_%*\\~@!][-+a-zA-Z0-9_%*\\#~@!]* 

%%

"["     return token::LBRACKET;
"]"     return token::RBRACKET;
":"     return token::COLON;
","	return token::COMMA;
";"	return token::SEMICOLON;
"|"     return token::PIPE;
"{"	return token::LBRACE;
"}"	return token::RBRACE;
"/"     return token::GROUP_SEP;
"."     return token::PATH_SEP;
"="     return token::ASSIGN;

"=="    return token::EQUAL;
"!="    return token::NOT_EQUAL;
">"	return token::GREATER;
">="    return token::GREATER_EQUAL;
"<"     return token::LESS;
"<="    return token::LESS_EQUAL;
"~="    return token::REGEX_MATCH;
"<<"    return token::LESS_BBOX;
">>"    return token::GREATER_BBOX;
"*="    return token::MASK;

[ \t]+  /* ignore these */

[\r\n]+ /* ignore these */

{WORD}  { yylval->build<std::string>(yytext); return token::WORD; }

<INITIAL><<EOF>> return token::END;

["]    { BEGIN(quote); yymore(); }

<quote>[^"\\]*  yymore(); /* Anything that's not a double quote or a backslash */

<quote>[\\]["]	yymore(); /* This matches the escaped double quote (\") */

<quote>[\\]{2}  yymore(); /* This matches an escaped escape (\\) */

<quote>[\\]{1}  {
                    BEGIN(INITIAL);
                    if (yytext) {
                        YY_FATAL_ERROR("Inside a string, backslash (\\) can escape a double quote or must itself be escaped (\\\\).");
                    }
                }

<quote>["]  { 
                /* An unescaped double quote in the 'quote' state indicates the end of the string */
                BEGIN(INITIAL); 
                yylval->build<std::string>(yytext); 
                return token::STRING;
            }

<quote><<EOF>>	{
                  BEGIN(INITIAL);   /* resetting the state is needed for reentrant parsers */
                  YY_FATAL_ERROR("Unterminated quote");
                }

.   {
        BEGIN(INITIAL);
        if (yytext) {
            YY_FATAL_ERROR("Characters found in the input were not recognized.");
        }
    }

Discussion

Dennis : 9/1/13 Lexical Forms

I think that for specification purposes, we need to stick to the previously defined lexical forms that are defined in section 11 of the specification. Introducing any alterations in those lexical elements will just cause confusion.

Jimg 09:34, 3 September 2013 (PDT) I agree and I'll look over that and try to unify the documents

Dennis 9/2/13: Curly Braces

The problem that leads to the use of curly braces is this. What we are doing it specifying a series of paths from top-level variables of type Structure or Sequence through any nested Structures or Sequences to a final field S1.S2.x, S1.S2 etc.

In some (limited?) cases, it would be desirable to unify those paths so that common prefixes of the paths need not be repeated. I say limited, because I have no idea how common such repetions would occur in practice. In any case, we can unify two paths S1.S2.x and S1.S3, say, as S1.{S2.x;S3} [Note that I am using ".{" instead of just "{").

Jimg 09:51, 3 September 2013 (PDT) Hmmm. OK.

One issue that needs resolution is this. Suppose, we want to unify S1.S2.x and S1.S2. Writing S1.S2.{x} is clearly incorrect. We could write S1.{S2;S2.x}, or perhaps S1.S2.{;x}, or disallow it.

In any case, we need to specify some rules, such as, do we require maximal prefixes or allow less than maximal prefixes?

Jimg 09:57, 3 September 2013 (PDT) We agreed this would result in an error (s1.{s2,s2.x})

I must confess that using curly braces is, IMO, less than optimal because I would prefer to reserve it in case we need it for use in filter expressions. The only alternative I can see is <...>, and that does not seem attractive.

Dennis 9/2/13: Anonymous Grouping

I am not enamored of this idea. Especially when used with filters. My interpretation is that we are trying to again avoid repeating information such as filters or dimension subsets.

I should note that Ferret addresses the same issue using a form of Let x = ... in ... notation and I think for our purposes some variant of this makes for a cleaner solution.

In line with that I propose that we use a local variable approach, where we can say things like D=0:4:12 and F=|expr. Then these local variables can be used later in the constraint.

Jimg 09:57, 3 September 2013 (PDT) I don't think we need to introduce local variables and we should avoid that if they are not absolutely needed.

Dennis 9/2/13: Variable Binding in Filters

Given a filter expression v|expr, the expression part of the filter will contain (at least?, only?) one variable that must be bound. An issue to be solved is how to bind all variables in a filter to something in the underlying dataset, and what restrictions we put on that binding.

One example above:

Points2[10:20] | 20 < x < 40, y <35, sounding.depth > 10

illustrates the binding problem. I would argue this should be written:

Points2[10:20] | 20 < Points2.x < 40, Points2.y <35, Points2.sounding.depth > 10'

One side note: we should not allow any free variables that are not bound to something on the immediate left side of the filter. This means no references to data variables not on the left side and no reference to outer variables when filters are nested.

Jimg 09:34, 3 September 2013 (PDT) I agree - no free variables and FQNs in the filter subexpressions.

Dennis 9/2/13: Sequence Filtering

[Notation: S refers generally to a Structured object, SQ to a Sequence, and A to an array of objects]

For a simple sequence, SQ, and given SQ|expression and assuming we somehow bind SQ to some variable x in the expression, we need to decide what things inside SQ (e.g. SQ.f1) we are allowed to reference in the expression. We had this discussion a long time ago, and I proposed that we restrict it to only be able to reference scalar atomic variables. So SQ.S[0].x would be disallowed.

The binding issue is complicated if we want to return some subportion of a filtered variable. So, we might want to say SQ.{f1;f2}|SQ.f3>5. This seems ok.

Perhaps more interesting is this: S1[1:2].SQ.{f1;f2}| ?.x > 5. Should this be written as S1[1:2].SQ.{f1;f2}| S1.SQ.x > 5.

Dennis 9/2/13: Array Filtering

When filtering arrays, many of the above issues for sequences also occur.

Let us assume an array, A and a filter expression such as A|expression.

Suppose that A is of type Structure. Then presumably we need to decide what fields inside A (e.g. .f1) we are allowed to reference in the expression. Presumably, we can use the sequence proposal above restrict it to only be able to reference scalar atomic variables. So A.S[0].x would be disallowed.

Question: Above, it was indicated that SQ[0:10] is legal, so if A is Sequence, can we treat it like an array and do Array filtering instead of tuple filtering?

The issue of returning subfields of selected elements of A also occurs. So, can we say A.{f1;f2}|A.f3>5? I think this conflicts with the no-data idea; perhaps, we need no-data for missing structure fields.

[[User:dmh]Dennis]] 9/2/13: No Data

We agreed on the idea of no-data. The issue is specifying it. The ND= idea is good. However, the discussion above seems to be implicitly assuming we are only return atomic variables. What if what we are returning is a whole Structure possibly containing nested structures?

Jimg 09:43, 3 September 2013 (PDT) Yes, at the point where I thought about filters on arrays of non-atomic types, I punted.

Dennis 9/2/13: @=

I need to think about this, but questions concern the situation when the map is more than 1-dimenensional. I think this could better be handled using the local variable proposal above.

Jimg 09:46, 3 September 2013 (PDT) I think there's an important point I didn't make in the part about this operator: I want the return type to be the same as the 'input type'. So if an Int32 a[1024][1024] is filtered using @=, then you get an Int32 [1024][1024] out. If the users wants a subset of the array to be filtered, they ask for that by subsetting the array and then filtering the result.

Dennis 9/2/13: Restricting the use of filters

After thinking about it, I believe we should put serious limits (for now) on where filters can be used. Specifically, I propose the following.

  1. No nested filters - one filter is allowed for each clause (a clause is what is separated by semicolons)
  2. The top level var in the clause is the bound variable of the filter and its name is used as the variable inside the filter.
Jimg 11:10, 3 September 2013 (PDT) #1 I agree
Jimg 11:10, 3 September 2013 (PDT) #2 I agree; explanation: each clause has a top level name and that name is scope of the filter expression.

IMO, Some consequences:

  1. No anonymous grouping
  2. scratch my named filters idea above (but keep the name range idea).b

The reason I propose this is that our language is getting too complicated and I want to see how it implements before making it more complex.