DAP4: Constraint Expressions, v2: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 210: Line 210:
     // The maps ''lat'' and ''lon'' are used here and define a coverage
     // The maps ''lat'' and ''lon'' are used here and define a coverage
     Float32 temp(lon)(lat);
     Float32 temp(lon)(lat);
     Float32 sal(lon)(lat)
     Float32 sal(lon)(lat);
    Float32 O2(lat)(lon);
    Float32 CO2(lon)(lat)[10];
} shared_dimensions;
} shared_dimensions;
</source>
</source>
==== Examples of the anonymous grouping syntax and coverage subsetting by index ====
; ''{lat; lon; temp} [0:9][10:19]'': This will return Dimensions nlat=10, nlon = 10, ''lat'', ''lon'' and ''temp'' such that lat an lon are 10 element vectors and ''temp'' is a 10 x 10 array.
; ''{lat; lon; temp; sal} [0:9][10:19]'': Same as above, but with both ''temp'' and ''sal'' included.
; ''{lat; lon; temp; O2} [0:9][10:19]'': Error: O2 is not the same shape as temp
; ''{lat; lon; temp,CO2} [0:9][10:19]'': Error: CO2 is not the same shape as temp


==Filters ==
==Filters ==

Revision as of 20:58, 1 August 2013

<< Back to OPULS Development

Background

At the OPULS meeting in Boulder CO on 1-3 Oct 2012 we discussed how the concepts of the DAP2 constraint expressions could be extended to DAP4. We decided that the existing capabilities available in DAP2 should be included, but with some changes due to the differences in DAP4's data model and to fix problems with the DAP2 syntax and semantics. DAP4 constraint expressions (CEs) will support projection in much the same way that DAP2's CEs did. Variables are chosen from a dataset by listing them in the CE in a comma separate list. We decided that the language and syntax of the selection was flawed and will be replaced by filters that are applied to single clauses in the CE, not to the whole CE as was the case with DAP2. By dropping the database language of relations we are making a distinction that the CE semantics does not intended to support the relational calculus, but does support choosing specific values from different types of variables. Unlike DAP2, these filters can be applied to arrays as well as 'sequences' (which we have technically dropped from the data model in favor of structures with varying dimensions). Lastly, we talked as some length about the relative merits of incorporating a functional programming language into the CE syntax and decided to do so.

Problem Addressed

Remote data access depends on having a simple and powerful query interface. Users must be able to choose parts of a large and often very complex dataset. Unlike most 'Big Data,' the datasets that DAP (and, hence, OPULS) targets are highly structured, and the CE semantics must reflect this. In practice this means that we must be able to choose parts of a dataset by marking some or all of the variables it contains as the ones we would like to access. In addition, it is useful to be able to 'slice' array variables so that a subset, in index space, is returned. In addition, it is often necessary to subset variables by value, accessing all of the elements of an array or structure that are within a certain range.

Proposed Solution

The proposed solution is based on the existing DAP2 constraint expression syntax, with two main modifications that result from differences in the DAP4 data model as well as fixes for ambiguities in the DAP2 CE syntax and semantics. The ambiguities arose from applying a DAP2 selection to the entire CE, which didn't always make sense (so the user or evaluator had to make some arbitrary choices about what a particular selection subexpression meant) and in calling functions on the dataset or in the selection-part of the CE. the latter was less of an ambiguity than an annoyance for users because important metadata about the potential return from a function was not available in the DAP2 CE. Lastly, some subsetting expressions would alter the type of the variable(s) they returned (i.e., Grids could be returned as Structures in some cases) and this behavior was either the cause of much complexity or not handled properly by clients.

Terminology used by this document:

selection expression
The entire expression passed to the server that is used to choose specific parts of a dataset.
subset
The act of choosing parts of a dataset based on the type of one or more of its variables. We define several types of subsetting operations as follows:
index subsetting
Choosing parts of an array based on the indexes of that array's dimensions. This operation always returns an array of the same rank as the original, although the size of the return array will (likely) be smaller. Index subsetting uses the bracket syntax described later.
field subsetting
Choosing specific variables or fields from the dataset. A dataset in DAP4 is made up of a number of variables and those may be Structures or Sequences that contain fields (and, in effect, the Dataset is itself a Structure and all of its variables are fields - the distinction is more convenience than formal). Field subsetting using the brace syntax described later. One or more fields can be specified using a semicolon (;) as the separator.
filter
A filter is a predicate that can be used to choose data elements based on their values. the vertical bar (|) is used as a prefix operator for the filter predicate. Filters can be applied to elements of an Array or fields of a Sequence. A filter predicate consists of one or more filter subexpressions. One or more subexpressions can be specified, using a comma (,) as the separator.
filter subexpression
A simple expression that consists of a single variable/field, a relative operator (=, !=, <, <=, >, >=) for numbers or a comparison operator (=, ~=) for a string (~= is a regex compare).
id
The name of a variable. These can be relative or absolute. Absolute names use the FQN proposal (See ).

Subsetting Constraints

The simplest constraint is the null string and it means 'return everything' from the dataset. Choosing variables in a dataset is referred to as the subset. To choose a subset of the variables in a dataset, enumerate them in a semicolon-separated list. To choose parts of a Structure, name those parts explicitly using the syntax structure name{field name}. Each DAP4 dataset contains one or more Groups; the top-level Group is always present and is named / (pronounced 'root'). If the root Group is the only Group in the dataset, it does not need to be named when listing variables in the CE. However, if there are other Groups in the dataset, each Group other than the root Group must be named. In any case, naming the root Group is optional.

Names are case sensitive.

Example: subsetting by variable or field

Note: The syntax used for the examples is (hopefully) easier to read than the DAP4 DMR which uses XML; Curly braces indicate hierarchy.

Dataset {
    Int32 u;
    Int32 v;
    Structure {
        Int32 x;
        Int32 y;
    } Point;
} projections;

Note: Variable names are case-sensitive.

Access just u
u
Access just u and v
u;v
Access just x within Point
Point{x} This notation is based on the use of brackets in DAP4: Proposal for Structure Projection and DAP4: DAP4 Filter Constraints with the exception that braces ({}) are likely easier to parse than brackets ([]) given that arrays of both Structure and Sequence are possible and thus with arrays of these structures the grammar that defines the constraint expression syntax would become context sensitive.
Access u and v by explicitly naming their Group
/u;/v. Every dataset in DAP4 has a root Group, written /. When that is the only Group in a dataset, it is implicit in the CE, but you can still use its name explicitly.
Dataset {
    Int32 u;
    Int32 v;
    Group {
        Int32 u;
        Int32 v;
	Structure {
	    Int32 x;
	    Int32 y;
	} Point;
   } inst2;
} Explicit_Group;
Access 'top-level' u and v
/u;/v or u,v
Access 'top-level' u and v and inst2's u and v
/u;/v;/inst2/u;/inst2/v, or u;v;/inst2/u;/inst2/v
Access inst2's u and v
/inst2/u;/inst2/v
Access Point 's x, which is inside the inst2 Group
/inst2/Point{x}

Array Subsetting in Index Space

Subsetting fixed-size arrays in their index space is accomplished using square brackets. For an array with N dimensions, N sets of brackets are used, even if the array is only subset on some of the dimensions. The names of array variables are fully qualified names (FQNs) so it's possible to name arrays in structures and/or Groups. Array index values are zero-based as with a number of programming languages such as C and Java. Every array has a known stating index value of zero. Within the square brackets, several subexpressions are allowed:

[ n ]
return only the value(s) at a single index, where 0 <= n < N for a dimension of size N. This still returns an array, just with a dimension size of one. This is done to preserve the type of the variable.
[ ]
return all of elements elements for a particular dimension
[ start : step : stop ]
return every step value between start and stop. This is the complete version of the syntax.
[ start : stop ]
return the values betweenstart and stop (start and stop define a closed interval).
[ start : ]
return the values from start to the end of the dimension.
[ start : step : ]
return every step' value from start to the end of the dimension.

The subsetting operator can be applied to any array.

Example: Subsetting in Index Space

Dataset {
    Int32 u[256][256];
    Int32 v[256][256];
    Structure {
        Int32 x;
        Int32 y;
    } Point[256];
} arrays;
Access all of u
u
Access all of Point 's x field
Point{x}. This returns an array of Structures with a single (Int32) element, not and array of Int32. Some would Point{x}[] and Point{x}[0:1:255]
Access elements 10 to 20 of array Point
Point[9:19]. DAP4, like DAP2, uses zero-based indexes. This CE will return the 10th to the 20th elements (Structures in this case) of the array
Access every 4th element in the Point array
Point[0:4:255], or Point[:4:]. This is a simple decimation operation; this CE would return 64 Structures corresponding to elements at indexes 0, 3, 7, ..., 255 of the array.
The index-space and field subsetting may be combined in the logical way
Point{x}[:4:] will return an array of structures (with 64 elements) named Point that contains a single Int32 field named x.
Access parts of u and v
u[4:2:9];v[4:2:9]

Other possible CEs:

u[0:4:][0:4:]
every fourth element in both dimentsions; this would return 1/16^th of the array's data.
u[][9:19]
elements corresponding to every row and columns 10 to 20
u[7][9:19]
elements corresponding to the 8^th row and columns 10 to 20
u[9:19][9:19]
elements corresponding to rows 10 to 20 and colums 10 to 20.
u[0:19][0:19]
elements corresponding to rows 0 to 20 and columns 0 to 20.
u[][]
identical to u, as are u[0:][0:] and u[0:1:][0:1:].

More complex subsetting examples

The data model for DAP4 is very similar to that of a modern structured programming language where constructor types like Structure may contain any allowed type, including other Structures and arrays of Structures as well as being arrays themselves. The basic syntax outlined so far for Structures, the selection of fields within a Structure and array dimension subsetting by index can be applied to these 'recursive' types by following the rules laid out in the basic cases. Some examples follow:

Dataset {
    Int32 u[256][1024];
    Structure {
        Int32 x;
        Int32 y[1024];
    } Points[256];
} example;
Points{y[7:256]}
Get all of the elements of the Array of Structure Points and for each of those elements get the elements 7 to 256 from the array y. Do not return the field x.
Points{y[0:9]}[0:9]
Get the first ten elements of Points and, for each of those, only the first ten elements of the array y.
Points[0:9]
Get the first ten elements of Points (both fields are included)
Points
Get all of Points
Dataset {
    Int32 u[256][1024];
    Structure {
        Int32 x;
        Int32 y;
        Structure {
            Int32 height[1024];
            Int32 pressure[1024];
        } sounding;
    } Points[256];
} example;
Points{x,y,sounding{height[0:8:]}[0]
Get only the first element of Points and, for that, get the fields x, y and sounding but for sounding get only every 8th element of the field height and elide the field pressure.

How Sequences fit into this syntax

The Sequence type has been added back into the DAP4 type system (aka data model) as a way to encode what the Common Data Model (CDM) encodes using varying dimensions for arrays (see DAP4: VLEN proposal). As such, Sequence will be a more general data type than in DAP2 where it was significantly limited. In DAP4 Arrays of Sequences will be allowed as will Sequence elements that are themselves Arrays. Thus, Sequences will be a completely general data type, able to hold instances of any type and able to be instances of any type - of course there are only two constructor types: Sequence and Structure. However, Sequences have not only the projection capabilities but also filtering by value.

The by value filter is indicated by the 'pipe' symbol (|). It provides one or more expressions, separated by commas. The filter is true when all of the subexpressions are true. Like the brace and bracket syntactical elements, symbols in the expressions that are part of the filter may be either FQNs or identifiers in the scope of the variable to which the expression is associated.

Dataset {
    Sequence {
        Int32 x;
        Int32 y;
    } s1;

   Sequence {
        Int32 x;
        Int32 y;
    } s2[100];

    Sequence {
        Int32 x[10];
        Int32 z;
    } s3;

     Sequence {
        Int32 x[1024];
    } s4[100];
} example;
s1
All of Sequence s1.
s1{x;y}
Also all of Sequence s1.
s1{x}
every 'row' of Sequence s1, but just field x.
s1{x;y}|x<7,y<9
All of s1 where the fields x and y satisfy the given filter expression (filtering is covered in detail on a subsequent section of this document).
s2{x;y}
All one hundred elements of the Array s2. Same as s2 and s2{x,y}[0:99].
s2{x;y}[0:9]
The first ten elements of s2. That would be 10 Sequences and for each, both the fields x and y.
s2{x;y}[0:9]|x<8
The first ten elements of s2 where only the 'rows' where x<8 are included.

Anonymous Grouping and Array Subsetting

Anonymous grouping provides a way to apply the same index subsetting to a set of arrays, as long as they are related in certain ways. To be used in this kind of subsetting, it must be possible to take one set of index subsetting constraints and use it with a set of arrays and for this to be true, one of two conditions must be met. If all of the arrays in an anonymous group are of the same shape (i.e., they have the same rank and each dimension has the same size), then it is trivial to apply one index subset expression to each array and return the result. If, alternatively, the collection of arrays in the anonymous group are related such that they form a grid then it is also possible to apply the index subset to the set of arrays in a logically consistent way. Note that the term anonymous group(ing) is riff on the idea that the notation already introduced for Structures and Sequence is a kind of grouping of the named Structure or Sequence's fields.

Example of anonymous grouping syntax

Dataset {
    Float32 lat[100][100];
    Float32 lon[100][100];
    Byte SST[100][100];
    Int64 WindSpeedU[100][100];
    Int64 WindSpeedV[100][100];
} example;
{SST;lat;lon}[0:9][10:19]
This will return SST, lat and lon subset as instructed.
{SST;lat;lon;WindSpeedU,WindSpeedV}[0:9][10:19]
Similarly, this will return the five arrays all subset with the common index subset of [0:9][10:19].

Subsetting and Shared Dimensions

Shared dimensions provide the additional information to indicate that a group of arrays share certain relationships; that specific groups of the arrays form coverages (aka grids). If one or more coverages are given using the anonymous group notation, then a single index subsetting operation can be applied to all of the component arrays and the result returned.

Because DAP4 uses XML for it's actual grammar, and because that's wordy and this document uses a mock notation, I will extend the notation used so far so it can 'define' a coverage:

  • The keyword Dimensions introduces a list of symbols and their sizes. (That is the definition of a Dimension in DAP4; a size bound to an identifier.)
  • Arrays where every dimension uses a Dimension to supply its extent are maps. Maps are the arrays that hold the domain values for a coverage.
  • Arrays that use parenthesis () in place of braces to indicate the sizes of dimensions and which use the names of maps to do so for at least one dimension hold the range values for the coverage. These are the coverage's data array (aka array as distinct from maps).

An anonymous group contains a coverage if all of the arrays it contains are either coverage data arrays or the maps for the included data arrays. It's possible to include two data arrays, but they have to be the same shape and use the same maps.

Example of this syntax

Dataset {
    Dimensions: nlat=100, nlon=50; 
    Float32 lat[nlat];
    Float32 lon[nlon];

    // The maps ''lat'' and ''lon'' are used here and define a coverage
    Float32 temp(lon)(lat);
    Float32 sal(lon)(lat);
    Float32 O2(lat)(lon);
    Float32 CO2(lon)(lat)[10];
} shared_dimensions;

Examples of the anonymous grouping syntax and coverage subsetting by index

{lat; lon; temp} [0:9][10:19]
This will return Dimensions nlat=10, nlon = 10, lat, lon and temp such that lat an lon are 10 element vectors and temp is a 10 x 10 array.
{lat; lon; temp; sal} [0:9][10:19]
Same as above, but with both temp and sal included.
{lat; lon; temp; O2} [0:9][10:19]
Error: O2 is not the same shape as temp
{lat; lon; temp,CO2} [0:9][10:19]
Error: CO2 is not the same shape as temp

Filters

Whilesubsetting provides ways to choose data based on the structure of a dataset, filters provide a way to choose data based on their values. The values to be returned are denoted using simple predicates. When an array is filtered by value using a filter, the elements of the array that fail the filter predicate(s) will be replaced with a NoData (ND) value. The array may also have am index subset, in which case that will be applied 'before the filter predicates are evaluated for the remaining elements.

The general syntax for a filter expression is to follow a subset expression with a pipe (|) and one or more filter predicates. Multiple predicates are separated by commas and the value of complete predicate is the logical AND of the comma-separated subexpressions.

Filter expressions can be applied to both Array and Sequence variables. In each case the result of the filter operation is a value in the same type variable. A filter applied to an Array returns an array with the same shape (the same rank and the same dimension sizes) but the elements that fail the predicate will be replaces with the variable's No Data value. For an Array, it is possible to supply a value to use for No Data if no such value is part of the dataset's metadata (or that will be used in preference to any such value). A Sequence variable is essentially a table of values and thus can be thought of as containing a number of rows and the filter expression is applied to each row in the order those rows are provided to the expression evaluator. Every row that satisfies the predicate will be included in the value returned; those that don't will be elided from the result.

Example: Filters

Dataset {
    Float64 temp[10];
    Float32 u[256][256];
    Float32 v[256][256];
    Sequence {
        Int32 x;
        Int32 y;
    } Points;
} arrays;
temp|temp<7
Return all ten elements of temp but replace those elements that do not satisfy the predicate temp < 7 with the variable's No Data value.
temp|temp<7,ND=0
Same as the above, but use zero as the value for No Data.
u|u>10,ND=-1e-37
Return u with any element that is not > 10 set to the No Data value of -1 * 10-37.
u|u>10, ND=NaN
Return u with any element that is not > 10 set to the No Data value of Nan.
u|u>10
Assume that the variable u has no missing_value attribute, then this will return u with any element that is not > 10 set to the No Data value of Nan (Nan is the default for No Data value for Float32 and Float64 types).
u[0:2:99][0:2:99]|u>10
This shows how filtering can be combined with subsetting by index using this syntax. The index subsetting is performed first and the resulting 50 X 50 array is filtered. The return value is a 50 x 50 Float32 array with elements <= 10 replaced with NaN (or the variable's missing_value attribute).

Filters and more complex data types

The basic syntax for filters is that there is a subsetting expression a literal | and then one or more filter predicates. This syntax can appear any place a selection expression can appear, so it can be used inside braces when an Array or Sequence is a field of a Structure or Sequence. Some examples follow.

Example: Filters on complex types

Dataset {
    Sequence {
        Int32 x[100];
        Int32 y;
    } Points1;

    Sequence {
        Int32 x;
        Int32 y;
        Sequence {
            Int32 depth;
            Int32 temp;
        } sounding;
    } Points2[20];

    Structure {
        Int32 x[;
        Int32 y;
        Sequence {
            Int32 depth;
            Int32 temps[4];
        } raw;
    } Points3[100]

} complex_types_example;
Points1{x[0:9]|x<10,ND=-127;y}|y<3
For the Sequence Points1, return the rows of data where y is less than 3. In those rows, subset x so that only the first ten elements are included and then filter that so that elements >= 10 are replaced by the No Data value of -127.
Points2[10:20] { x; y; sounding { } | depth > 10 } | 20 < x < 40, y <35
This selection expression first finds the index subset of Points2 and arranges to return the fields x, y and sounding where xand y satisfy the filter predicate. For the field sounding, which is a Sequence itself, it will return both fields and all rows where depth is > 10. This example points out an important aspect to the syntax and to expression evaluation: the order of evaluation of the filter predicates happens after the index and variable and/or field subsetting. The order of evaluation of the complete filter predicates can happen in any order (i.e., the 20 < x < 40, y <35 and depth > 10 predicates can happen in any order. The order of evaluation of the filter predicate subexpressions (i.e., 20 < x < 40 and y <35) is also unspecified. Another way to write this selection expression is...
Points2[10:20] | 20 < x < 40, y <35, sounding.depth > 10
The same result as above...
Points3[3:2:8] {x; y; raw{temps[2]} | temps[2]>7,ND=-1
In this expression the temps field of the Sequence raw is still an Array, it's just an Array with a single element, which illustrates that neither the subsetting nor filtering operations alter the types of the variables.

Filters and grid variables

A function is a mapping from a set of domain values to a set of range values; in a discrete function, these sets are finite. A discrete coverage is a (discrete) function where the indices of the arrays that hold the domain and range values have a one-to-one and onto mapping, with the important exception that in cases where the dimension of the arrays containing the domain values can be reduced without loosing information, that is done. This is purely an implementation optimization, but where applicable, it is nearly universally used. In DAP4 we call discrete coverages grids.

The anonymous grouping notation used to uniformly apply a set of constraints to grids can be extended to include filters. Because a grid essentially has two sets of values, and because selection by value is so important for this data type, there are two kinds of filters that can be applied to a coverage: value-based filtering and rubber-band filtering.

The same filtering operations that can be applied to simple arrays can be applied to a grid by simply following the braces that name the grid's components with the filtering operations. Only the named arrays will be subject to filtering, and like the case for simple arrays, elements that fail the predicates will be replaced with the No Data value. However, a second filtering operation is supported for grids where the values of the maps can be 'rubber-banded' using a filter and then the resulting information used to form an index subset for the grid (i.e., both the array and the maps). This operation acknowledges an operation that many users routinely perform. At the same time, this subsetting does not take into account domain-specific aspects of the data in the maps (e.g., taking into account the specific nature of latitude and longitude). In this kind of filtering, rectangular bounding boxes are derived from the result of applying the filter predicates. For each map, the bounding box that completely subsumes all of the elements that satisfy the predicate is formed. Then either, for the array, either the union or intersection of those bounding boxes is formed and the resulting index subset is applied to all of the arrays in the anonymous grouping.

Example: Simple filters on grids

The array filtering operation previously described can easily be applied to the components of a grid. Below we show two cases; where the maps are vectors and two-dimensional arrays.

Dataset {
	Dimensions: nlat=100, nlon=50; // the sizes of the dimensions
	Float32 lat[nlat];             // This is a (potential) map
	Float32 lon[nlon];           // As is this 

	Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
} vector_maps;
{lat;lon;temp} | lat < 20, -100 < lon < -80, ND=-255
Return all of lat, lon and temp where lat and lon have ben filtered as per the predicates. The values of temp are not altered.
{lat;lon;temp} [ 0 : 20][0 : 20] | lat < 20, -100 < lon < -80, ND=-255
The same as above, but the maps lat and lon and the array temp are subset using the index subsetting expression and the resulting arrays are filtered.
{lat;lon;temp} [ 0 : 20][0 : 20] | temp > 7, lat < 20, -100 < lon < -80, ND=-255
Same as above, but now temp is filtered too. This illustrates that even though there is a kind of binding between the maps and the array in a grid, it does not extend to this kind of filtering operation.
Dataset {
	Dimensions: x=100, y=50; // the sizes of the dimensions
	Float32 lat[x][y];             // This is a (potential) map
	Float32 lon[x][y];           // As is this 

	Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
} two_dim_maps;
{lat;lon;temp} | lat < 20, -100 < lon < -80, ND=-255
These examples repeat the above three, but show that the same syntax applies to the case where the maps are N-dimensional (in this case N == 2).
{lat;lon;temp} [ 0 : 20][0 : 20] | lat < 20, -100 < lon < -80, ND=-255
{lat;lon;temp} [ 0 : 20][0 : 20] | temp > 7, lat < 20, -100 < lon < -80, ND=-255

Example: Filters that subset grids

Dataset {
	Dimensions: nlat=100, nlon=50; // the sizes of the dimensions
	Float32 lat[nlat];             // This is a (potential) map
	Float32 lon[nlon];           // As is this 

	Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
} vector_maps;
Dataset {
	Dimensions: x=100, y=50; // the sizes of the dimensions
	Float32 lat[x][y];             // This is a (potential) map
	Float32 lon[x][y];            // As is this 

	Float32 temp(lon)(lat);     // The maps ''lat'' and ''lon'' are used here and define a ''grid'' (aka ''coverage'')
} two_dim_maps;
{lat;lon;temp} | lat << 20, -110 << lon << -80
For the grid consisting of lat, lon and temp, return the (index) subset of the three arrays by computing that index subset as follows: Find the smallest circumscribing bounding box for lat such that all values less than 20 are included. As with simple filtering, the values that are not less than 20 are replaced with the No Data value. Perform the same operations on lon, filtering as for the predicate -110 < lon < -80. Then form the intersection of these bounding boxes and use the resulting bounding box as the index subset for the grid. Note that the maps lat and lon may be returned with No Data values in some of the elements.

Grammar

expression :== selections

selections :== selection | selection ';' selections

selection :== subset | subset '|' filter

subset :== id | id indexes | id fields | id indexes fields | fields indexes

indexes :== '[' ']' | [ INT ] | [ INT ':' INT ] | [ INT : INT : INT ] | [ INT : ] | [ INT : INT : ] 1

fields :== '{' selections '}'

filter :== predicate | predicate ',' filter

predicate :== id relop constant | constant relop id | constant relop id relop constant | 'ND' '=' constant 2

constant :== INT | FLOAT | STRING

relop :== '=' | '!=' | '<' | '<=' | '>' | '>=' | '~=' | '<<' | '>>' 3

id :== groups vars | vars

groups :== '/' | '/' name groups

vars :== name | name '.' vars

name :== WORD | STRING 4

Notes:

  1. The document originally included [ : INT ] and [ : INT : INT ] but, if our users cannot figure out that the first element of an index that starts at zero is zero ... Including a syntax for 'to the end' is useful since the same expression can be used when the array size changes. This might be useful in aggregations. For the INTs in this rule, no negative values are allowed.
  2. The fourth alternative has some context sensitive rules that need to be tested for the parse to succeed: It can only occur for an array and the type of the value must match the type of the array (number is an int32, Float64, et c.; string to String or Url) and there can only be one instance per array.
  3. The relop ~= is a regex match and applies only to string types. The << and >> operators perform filtering on less than and greater than as well as finding the smallest circumscribing bounding box.
  4. A STRING is text, including spaces, dots and slashes and pretty much anything else, wrapped in double quotes. Double quotes in the content of the STRING can be escaped by a backslash (\) and, yes, the backslash can be escaped by itself.

Discussion