Aggregation enhancements: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
No edit summary
Line 1: Line 1:
In response to requests from NASA, and with their support, we have added two new kinds of aggregation to Hyrax. Both of these aggregation operations provide a way for client software to specify the granules that will be used to build the aggregate result. While our existing aggregation interface, based on NcML, works well for NASA's level 3 data products, it is all but useless for level 2 'swath data'. These aggregation functions are specifically designed to work with satellite swath data without being limited to just swath data and are explicitly intended for use with search interfaces that have knowledge of the individual files that make up typical satellite data sets (often called a ''dataset inventory'').
In response to requests from NASA, and with their support, we have added two new kinds of aggregation to Hyrax. Both of these aggregation operations provide a way for client software to specify the granules that will be used to build the aggregate result. While our existing aggregation interface, based on NcML, works well for NASA's level 3 data products, it is all but useless for level 2 ''swath data''. These aggregation functions are specifically designed to work with satellite swath data without being limited to just swath data and are explicitly intended for use with search interfaces that have knowledge of the individual files that make up typical satellite data sets (often called a ''dataset inventory'').


== Overview of the new capability ==
== Overview of the new capability ==
Summary: This service provides value-based subsetting for satellite ''swath data''. It's applicable to lots of other kinds of data, but works best with data that meet certain requirements.
Providing search results that include explicit references to hundreds or thousands of discrete files has been the only option for many search interfaces up to this point. This is especially when the datasets holds satellite swath data because swath data are not easily aggregated. For this interface to Hyrax's aggregation software, we provide two kinds of responses: Data in multiple files that are bundled together using an zip archive and data in tabular form. For clients that request the aggregate result in a zip file, given a request for values from N files, there will be N entries in the resulting zip archive. Some of these entries may simply indicate that no data matching the spatial or other constraints were found. While the source data files can be in any format that the Hyrax server can read, the response will be either netCDF3, netCDF4 or ASCII. The netCDF3/4 files returned will conform to CF 1.6 to the extent possible (the underlying data files may lack information CF 1.6 requires). For clients that request data in tabular form, the data from N files will be returned in one ASCII CSV response. These values can be easily assimilated by database systems, Excel and other tools.
Providing search results that include explicit references to hundreds or thousands of discrete files has been the only option for many search interfaces up to this point. This is especially when the datasets holds satellite swath data because swath data are not easily aggregated. For this interface to Hyrax's aggregation software, we provide two kinds of responses: Data in multiple files that are bundled together using an zip archive and data in tabular form. For clients that request the aggregate result in a zip file, given a request for values from N files, there will be N entries in the resulting zip archive. Some of these entries may simply indicate that no data matching the spatial or other constraints were found. While the source data files can be in any format that the Hyrax server can read, the response will be either netCDF3, netCDF4 or ASCII. The netCDF3/4 files returned will conform to CF 1.6 to the extent possible (the underlying data files may lack information CF 1.6 requires). For clients that request data in tabular form, the data from N files will be returned in one ASCII CSV response. These values can be easily assimilated by database systems, Excel and other tools.



Revision as of 16:24, 7 April 2015

In response to requests from NASA, and with their support, we have added two new kinds of aggregation to Hyrax. Both of these aggregation operations provide a way for client software to specify the granules that will be used to build the aggregate result. While our existing aggregation interface, based on NcML, works well for NASA's level 3 data products, it is all but useless for level 2 swath data. These aggregation functions are specifically designed to work with satellite swath data without being limited to just swath data and are explicitly intended for use with search interfaces that have knowledge of the individual files that make up typical satellite data sets (often called a dataset inventory).

Overview of the new capability

Summary: This service provides value-based subsetting for satellite swath data. It's applicable to lots of other kinds of data, but works best with data that meet certain requirements.

Providing search results that include explicit references to hundreds or thousands of discrete files has been the only option for many search interfaces up to this point. This is especially when the datasets holds satellite swath data because swath data are not easily aggregated. For this interface to Hyrax's aggregation software, we provide two kinds of responses: Data in multiple files that are bundled together using an zip archive and data in tabular form. For clients that request the aggregate result in a zip file, given a request for values from N files, there will be N entries in the resulting zip archive. Some of these entries may simply indicate that no data matching the spatial or other constraints were found. While the source data files can be in any format that the Hyrax server can read, the response will be either netCDF3, netCDF4 or ASCII. The netCDF3/4 files returned will conform to CF 1.6 to the extent possible (the underlying data files may lack information CF 1.6 requires). For clients that request data in tabular form, the data from N files will be returned in one ASCII CSV response. These values can be easily assimilated by database systems, Excel and other tools.

Intended audience

This service was originally intended for software developers working data search tools who need to be able to return results that encompass hundreds or thousands of granules. It works best from a programmatic interface, but it's certainly open to end users, see the examples using curl for one way to access the service.

Accessing the Aggregation Services

This 'service' is accessed using HTTP's GET or POST methods. In this documentation I will describe how to use POST to send information, but the same key-value parameters can be sent using the GET method, albeit within the character limits of a URL (which vary depending on implementation).

The service is accessed using the following set of key-value parameters:

operation
Use operation to select from various kinds of responses. The form of the response also determines how the aggregation is built. The current values for this parameter are: version which returns information about the service's version; file returns a collection of files; netcdf3, netcdf4, ascii all translate the underlying granule format to netcdf3, etc., and return that collection of translated files; csv returns data from many granules as a single table of data, using Comma Separated Values (csv) format. More information about this is given below.
file
The URL path component to a granule served by Hyrax. This parameter will appear once for each file in the aggregation.
var
A comma-separated list of variables to include in the files returned when using operation equal to netcdf3, netcdf4, ascii, or csv
bbox
Limit the values returned to those that fall within a bounding box for a given variable. Like var, this applies only to netcdf3, netcdf4, ascii, or csv
How to use these parameters

The operation and file parameters are the key to the service. By listing multiple files, you can explicitly control which files are accessed and the order of that access. The operation parameter provides a way to choose between a zipped response with many files either in their native format (file) or in one of three well known representations (netcdf3, netcdf4 or ascii).

While a complete request can make use of only the operation and file parameters, adding the variable and value subsetting can provide a much more manageable response. The var and bbox parameters can appear either once or N times where N is the number of time the file parameter appears. In the first case, the values of the single instances of var and/or bbox are applied to every file/granule listed in the request. In the second case the value of var1 is used with file1, var2 with file2, and so on up to varN and fileN. The same is true of the bbox parameter. Furthermore, these parameters act independently, so a request can use one value for var and N values for bbox or vice versa.

More about var

The var parameter is a comma-separated list of variables in the files listed in the request. Each of the variables must be named just as it is in the DAP dataset. If you're getting errors from the service that 'No such variable exists in the dataset ...', use a web browser or curl to look at one of the granules and see what the exact name is. For many NASA dataset, these names can be quite long and have several components, separated by dots. One way to test the name is to build a URL to the file and use the getdap (part of the libdap software package) tool like this

getdap -d <url> -c

If this returns an error, look at the DDS or DMR from the dataset and figure out the correct name. Do that using

getdap -d <url> or
getdap4 -d <url>
More about bbox

The bbox parameter is probably the most powerful of all of the parameters in terms of its ability to select specific data values. It has two different modes, one when used with the zip-formatted responses (i.e., operation is netcdf3, netcdf4 or ascii) and another when its used with operation equal to csv. In either case, bbox is used to select a range of values of a particular variable. The format for bbox values is a series of range requests surrounded by double quotes, where each range request has the following form

[ <lower value> , <variable name> , <upper value> ]

A n example looks like

&bbox="[49,Latitude,50][167,Longitude,170]"

When multiple range requests are given, the resulting subset is the union of the bounding boxes that contain the values for each of the variables given. While the notion of a bounding box is of a two-dimensional object, bbox can form bounding boxes for an array or arbitrary rank. However, when a particular bbox values lists more than one range request, the rank of each of the variable must be the same.

The bbox operation is special, however, because the range limitation applies not only to the variable listed, but to any other variables in the request that share dimensions with those variables. Thus, for a dataset that contains 'Latitude, longitude and Optical_Depth where all have the shared dimensions x and y, the bbox parameter will choose values of Latitude and Longitude within the given values and then apply the resulting bounding box to those variables and any other variables. Note that the variables in the bbox range requests must also be listed in the var parameter if you want their values to be returned

Response formats

This service will either return a collection of files bundled in a zip archive or it will return a since CSV/text file. When operation is file, netcdf3, netcdf4 or ascii, the service will take each of the files as they are retrieve or built and put them in a zip archive that it streams back to the client. The ZIP64(tm) format extensions are used to overcome the size limitations of the original ZIP format.

For the csv operation, the response is a single CSV/text file.

Performance

Performance is linear in terms of the number of granules. The response is streamed as it is built, so even very large responses use only a little memory on the server. Of course, that won't be the case on the client...

Requesting collections of files

This returns a Zip file containing a number of resources read/produced by the Hyrax BES. The ZIP64(tm) format extensions are used to overcome the size limitations of the original ZIP format.It can handle a list of resources (typically files) and simply return them, unaltered or translated into netCDF files. In the later case, a constraint expression can be applied to each resource before the transformation takes place, limiting the variables and/or parts of variables in the resulting netCDF file. Note that for the netCDF format response to work, the BES must be able to read the format of the original resource (e.g., HDF4).

Requesting tabular results

Server functions used