Aggregation enhancements: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 22: Line 22:


While a complete request can make use of only the ''operation'' and ''file'' parameters, adding the variable and value subsetting can provide a much more manageable response. The ''var'' and ''bbox'' parameters can appear either once or ''N'' times where ''N'' is the number of time the ''file'' parameter appears. In the first case, the values of the single instances of ''var'' and/or ''bbox'' are applied to every file/granule listed in the request. In the second case the value of ''var1'' is used with ''file1'', ''var2'' with ''file2'', and so on up to ''varN'' and ''fileN''. The same is true of the ''bbox'' parameter. Furthermore, these parameters act independently, so a request can use one value for ''var'' and ''N'' values for ''bbox'' or vice versa.
While a complete request can make use of only the ''operation'' and ''file'' parameters, adding the variable and value subsetting can provide a much more manageable response. The ''var'' and ''bbox'' parameters can appear either once or ''N'' times where ''N'' is the number of time the ''file'' parameter appears. In the first case, the values of the single instances of ''var'' and/or ''bbox'' are applied to every file/granule listed in the request. In the second case the value of ''var1'' is used with ''file1'', ''var2'' with ''file2'', and so on up to ''varN'' and ''fileN''. The same is true of the ''bbox'' parameter. Furthermore, these parameters act independently, so a request can use one value for ''var'' and ''N'' values for ''bbox'' or vice versa.
===== Response formats =====
This service will either return a collection of files bundled in a ''zip archive'' or it will return a since CSV/text file. When ''operation'' is ''file'', ''netcdf3'', ''netcdf4'' or ''ascii'', the service will take each of the files as they are retrieve or built and put them in a zip archive that it streams back to the client. The ZIP64(tm) format extensions are used to overcome the size limitations of the original ZIP format.
For the ''csv'' operation, the response is a single CSV/text file.


===== More about ''var'' =====
===== More about ''var'' =====
Line 31: Line 36:


===== More about ''bbox'' =====
===== More about ''bbox'' =====
The ''bbox'' parameter is probably the most powerful of all of the parameters in terms of its ability to select specific data values. It has two different modes, one when used with the zip-formatted responses (i.e., ''operation'' is ''netcdf3'', ''netcdf4'' or ''ascii'') and another when its used with  ''operation'' equal to ''csv''. In either case, ''bbox'' is used to select a range of values of a particular variable. The format for ''bbox'' values is a series of ''range requests'' surrounded by double quotes, where each range request has the following form
The ''bbox'' parameter is probably the most powerful of the parameters in terms of its ability to select specific data values. It has two different modes, one when used with the zip-formatted responses (i.e., ''operation'' is ''netcdf3'', ''netcdf4'' or ''ascii'') and another when its used with  ''operation'' equal to ''csv''. However, ther are somethings that are common to both uses of the parameter. In either case, ''bbox'' is used to select a range of values for a particular variable or a set of variables. The format for a ''bbox'' request has the following form
:''['' <lower value> '','' <variable name> '','' <upper value> '']''
:''['' <lower value> '','' <variable name> '','' <upper value> '']''
A n example looks like
for each variable in the subset request. If more than one variable is included, use a series of ''range requests'' surrounded by double quotes. An example ''box'' request looks like
:&bbox="[49,Latitude,50][167,Longitude,170]"
:&bbox="[49,Latitude,50][167,Longitude,170]"
which translates to ''"for the variable Latitude, return only values between 49 and 50 and for the variable Longitude return only values between 167 and 170"''. Note that the example here uses two variables named ''Latitude'' and ''Longitude'', but any variables in the dataset could be used.


====== ''bbox & zip-formatted returns'' ======
The ''bbox'' operation is special, however, because the range limitation applies not only to the variable listed, but to any other variables in the request that share dimensions with those variables. Thus, for a dataset that contains ''Latitude'', ''longitude'' and ''Optical_Depth'' where all have the shared dimensions ''x'' and ''y'', the ''bbox'' parameter will choose values of ''Latitude'' and ''Longitude'' within the given values and then apply the resulting bounding box to those variables and any other variables that use the same named dimensions as those variables. The named (i.e., ''shared'') dimensions form the linkage between the subsetting of the variables named in the ''bbox'' value subset operation and the other variables in the list of ''var''s to return.
When multiple range requests are given, the resulting subset is the union of the bounding boxes that contain the values for each of the variables given. While the notion of a bounding box is of a two-dimensional object, ''bbox'' can form bounding boxes for an array of arbitrary rank. However, when a particular ''bbox'' value lists more than one range request, the rank of each of the variable must be the same.


The ''bbox'' operation is special, however, because the range limitation applies not only to the variable listed, but to any other variables in the request that share dimensions with those variables. Thus, for a dataset that contains ''Latitude'', ''longitude'' and ''Optical_Depth'' where all have the shared dimensions ''x'' and ''y'', the ''bbox'' parameter will choose values of ''Latitude'' and ''Longitude'' within the given values and then apply the resulting bounding box to those variables and any other variables that use the same named dimensions as those variables. The named (i.e., ''shared'') dimensions form the linkage between the subsetting of the variables named in the ''bbox'' value subset operation and the other variables in the list of ''var''s to return.
You can find out if variables in a dataset share named dimensions by looking at the DDS (DAP2) or DMR (DAP4) for the dataset. Note that for DAP4, in the example used in the previous paragraph, ''Latitude'', ''longitude'' and ''Optical_Depth'' form a 'coverage' where ''Latitude'' and ''longitude'' are the domain and ''Optical_Depth'' is the 'range'.


Note that the variables in the ''bbox'' range requests must also be listed in the ''var'' parameter if you want their values to be returned.
Note that the variables in the ''bbox'' range requests must also be listed in the ''var'' parameter if you want their values to be returned.


====== ''bbox & the csv response'' ======
The next two sections describe how the return format (zipped collection of files or CSV table of data) affects the way the ''bbox'' subset request is interpreted.


===== Response formats =====
====== ''bbox & zip-formatted returns'' ======
This service will either return a collection of files bundled in a ''zip archive'' or it will return a since CSV/text file. When ''operation'' is ''file'', ''netcdf3'', ''netcdf4'' or ''ascii'', the service will take each of the files as they are retrieve or built and put them in a zip archive that it streams back to the client. The ZIP64(tm) format extensions are used to overcome the size limitations of the original ZIP format.  
When the Aggregation Service is asked to provide a zipped collection of files (''operation'' = ''netcdf3'', ''netcdf4'' or ''ascii''), the resulting data is stored as N-dimensional arrays in those kinds of responses. This limits how ''bbox'' can form subsets, particularly when the values are in the form of 'swath data.' For this request type, ''bbox'' forms a bounding box for each variable in the list of range requests and then forms the ''union'' of those bounding boxes. For swath data, this means that some extra values will be returned both because the data rarely fit perfectly in a box for any given domain variable and then the union of those two (imperfect) subsets usually results in some data that are actually in neither bounding box. The ''bbox'' operation (which maps to a Hyrax server function) was designed to be liberal in applying the subset to as to include all data points that meet the subset criteria at the cost of including some that don't. The alternative would be to exclude some matching data. Similarly, the bounding box for the set of variables is the union for the same reason. Hyrax contains server functions that can form both the union and intersection of several bounding boxes returned by the ''bbox'' function.


For the ''csv'' operation, the response is a single CSV/text file.
====== ''bbox & the csv response'' ======
The


=== Performance ===
=== Performance ===
Performance is linear in terms of the number of granules. The response is streamed as it is built, so even very large responses use only a little memory on the server. Of course, that won't be the case on the client...
Performance is linear in terms of the number of granules. The response is streamed as it is built, so even very large responses use only a little memory on the server. Of course, that won't be the case on the client...

Revision as of 17:13, 9 April 2015

In response to requests from NASA, and with their support, we have added two new kinds of aggregation to Hyrax. Both of these aggregation operations provide a way for client software to specify the granules that will be used to build the aggregate result. While our existing aggregation interface, based on NcML, works well for NASA's level 3 data products, it is all but useless for level 2 swath data. These aggregation functions are specifically designed to work with satellite swath data without being limited to just swath data and are explicitly intended for use with search interfaces that have knowledge of the individual files that make up typical satellite data sets (often called a dataset inventory).

Overview of the new capability

Summary: This service provides value-based subsetting for satellite swath data. It's applicable to lots of other kinds of data, but works best with data that meet certain requirements.

Providing search results that include explicit references to hundreds or thousands of discrete files has been the only option for many search interfaces up to this point. This is especially when the datasets holds satellite swath data because swath data are not easily aggregated. For this interface to Hyrax's aggregation software, we provide two kinds of responses: Data in multiple files that are bundled together using an zip archive and data in tabular form. For clients that request the aggregate result in a zip file, given a request for values from N files, there will be N entries in the resulting zip archive. Some of these entries may simply indicate that no data matching the spatial or other constraints were found. While the source data files can be in any format that the Hyrax server can read, the response will be either netCDF3, netCDF4 or ASCII. The netCDF3/4 files returned will conform to CF 1.6 to the extent possible (the underlying data files may lack information CF 1.6 requires). For clients that request data in tabular form, the data from N files will be returned in one ASCII CSV response. These values can be easily assimilated by database systems, Excel and other tools.

Intended audience

This service was originally intended for software developers working data search tools who need to be able to return results that encompass hundreds or thousands of granules. It works best from a programmatic interface, but it's certainly open to end users, see the examples using curl for one way to access the service.

Accessing the Aggregation Services

This 'service' is accessed using HTTP's GET or POST methods. In this documentation I will describe how to use POST to send information, but the same key-value parameters can be sent using the GET method, albeit within the character limits of a URL (which vary depending on implementation).

The service is accessed using the following set of key-value parameters:

operation
Use operation to select from various kinds of responses. The form of the response also determines how the aggregation is built. The current values for this parameter are: version which returns information about the service's version; file returns a collection of files; netcdf3, netcdf4, ascii all translate the underlying granule format to netcdf3, etc., and return that collection of translated files; csv returns data from many granules as a single table of data, using Comma Separated Values (csv) format. More information about this is given below.
file
The URL path component to a granule served by Hyrax. This parameter will appear once for each file in the aggregation.
var
A comma-separated list of variables to include in the files returned when using operation equal to netcdf3, netcdf4, ascii, or csv
bbox
Limit the values returned to those that fall within a bounding box for a given variable. Like var, this applies only to netcdf3, netcdf4, ascii, or csv
How to use these parameters

The operation and file parameters are the key to the service. By listing multiple files, you can explicitly control which files are accessed and the order of that access. The operation parameter provides a way to choose between a zipped response with many files either in their native format (file) or in one of three well known representations (netcdf3, netcdf4 or ascii).

While a complete request can make use of only the operation and file parameters, adding the variable and value subsetting can provide a much more manageable response. The var and bbox parameters can appear either once or N times where N is the number of time the file parameter appears. In the first case, the values of the single instances of var and/or bbox are applied to every file/granule listed in the request. In the second case the value of var1 is used with file1, var2 with file2, and so on up to varN and fileN. The same is true of the bbox parameter. Furthermore, these parameters act independently, so a request can use one value for var and N values for bbox or vice versa.

Response formats

This service will either return a collection of files bundled in a zip archive or it will return a since CSV/text file. When operation is file, netcdf3, netcdf4 or ascii, the service will take each of the files as they are retrieve or built and put them in a zip archive that it streams back to the client. The ZIP64(tm) format extensions are used to overcome the size limitations of the original ZIP format.

For the csv operation, the response is a single CSV/text file.

More about var

The var parameter is a comma-separated list of variables in the files listed in the request. Each of the variables must be named just as it is in the DAP dataset. If you're getting errors from the service that 'No such variable exists in the dataset ...', use a web browser or curl to look at one of the granules and see what the exact name is. For many NASA dataset, these names can be quite long and have several components, separated by dots. One way to test the name is to build a URL to the file and use the getdap (part of the libdap software package) tool like this

getdap -d <url> -c

If this returns an error, look at the DDS or DMR from the dataset and figure out the correct name. Do that using

getdap -d <url> or
getdap4 -d <url>
More about bbox

The bbox parameter is probably the most powerful of the parameters in terms of its ability to select specific data values. It has two different modes, one when used with the zip-formatted responses (i.e., operation is netcdf3, netcdf4 or ascii) and another when its used with operation equal to csv. However, ther are somethings that are common to both uses of the parameter. In either case, bbox is used to select a range of values for a particular variable or a set of variables. The format for a bbox request has the following form

[ <lower value> , <variable name> , <upper value> ]

for each variable in the subset request. If more than one variable is included, use a series of range requests surrounded by double quotes. An example box request looks like

&bbox="[49,Latitude,50][167,Longitude,170]"

which translates to "for the variable Latitude, return only values between 49 and 50 and for the variable Longitude return only values between 167 and 170". Note that the example here uses two variables named Latitude and Longitude, but any variables in the dataset could be used.

The bbox operation is special, however, because the range limitation applies not only to the variable listed, but to any other variables in the request that share dimensions with those variables. Thus, for a dataset that contains Latitude, longitude and Optical_Depth where all have the shared dimensions x and y, the bbox parameter will choose values of Latitude and Longitude within the given values and then apply the resulting bounding box to those variables and any other variables that use the same named dimensions as those variables. The named (i.e., shared) dimensions form the linkage between the subsetting of the variables named in the bbox value subset operation and the other variables in the list of vars to return.

You can find out if variables in a dataset share named dimensions by looking at the DDS (DAP2) or DMR (DAP4) for the dataset. Note that for DAP4, in the example used in the previous paragraph, Latitude, longitude and Optical_Depth form a 'coverage' where Latitude and longitude are the domain and Optical_Depth is the 'range'.

Note that the variables in the bbox range requests must also be listed in the var parameter if you want their values to be returned.

The next two sections describe how the return format (zipped collection of files or CSV table of data) affects the way the bbox subset request is interpreted.

bbox & zip-formatted returns

When the Aggregation Service is asked to provide a zipped collection of files (operation = netcdf3, netcdf4 or ascii), the resulting data is stored as N-dimensional arrays in those kinds of responses. This limits how bbox can form subsets, particularly when the values are in the form of 'swath data.' For this request type, bbox forms a bounding box for each variable in the list of range requests and then forms the union of those bounding boxes. For swath data, this means that some extra values will be returned both because the data rarely fit perfectly in a box for any given domain variable and then the union of those two (imperfect) subsets usually results in some data that are actually in neither bounding box. The bbox operation (which maps to a Hyrax server function) was designed to be liberal in applying the subset to as to include all data points that meet the subset criteria at the cost of including some that don't. The alternative would be to exclude some matching data. Similarly, the bounding box for the set of variables is the union for the same reason. Hyrax contains server functions that can form both the union and intersection of several bounding boxes returned by the bbox function.

bbox & the csv response

The

Performance

Performance is linear in terms of the number of granules. The response is streamed as it is built, so even very large responses use only a little memory on the server. Of course, that won't be the case on the client...