Getting a tar ball instead of a CSV response

From OPeNDAP Documentation
Revision as of 19:44, 5 February 2015 by Jimg (talk | contribs) (→‎Notes)
⧼opendap2-jumptonavigation⧽

<-- Back

'Point Of Contact: James

Description

Satellite_Swath_2

Goal

NB: This use case is very similar to the Satellite_Swath_1 use case. The differences in the goals are highlighted in bold.

The user wants to get data from several granules of a many-granule data set using a single request. The user is willing to iterate over several response objects/files so long as those files are delivered in a single response and are netcdf files. However, this aggregation operation must work with Swath data from a satellite, which poses special problems. Furthermore, the aggregation should require little or no effort by the data provider to configure.

Summary

Principal actor: A person who wants to access level 2 satellite swath data from a number of granules using a single URL. The person may be an end user who issues a data request or they may be the developer of a system that will build a URL to return to an end user. In either case, an end-user dereferences the URL to get data.

Goal: Eliminate users having to use many URLs to get these data.

The data are Satellite Swath data (e.g., MODIS, level2) where each granule contains a number of dependent variables are matched to lat and lon arrays, all of which are 2D (this is the general case for a 2D discrete coverage). The data set is made up of a number of these granules, organized in a hierarchy of directories by date. Each of the granules holds one 'pass' of the satellite from one pole to another. When a user searches for data, they use a spatial and time box to choose values from the whole data set. Currently the response to that search is a list of granules, where each contains some of the data in the users query. The complete granules, or subset versions, may be downloaded. In the either case, the user must access/download a number of files and then the values combined somehow - the exact mechanism being left up to the end user, including reading the files that contain the data.

In this use case, the user instead uses a single URL to access all of the data that match the selection criteria. Those values will be returned using a collection of netCDF files bundled in a tar.gz file, so there is very limited file decoding burden put on the user. The URL will be one that a system developer can easily build in software as well as one that a person could write by hand, at least in many cases.

Actors

A developer building a 'response URL' from a user's data search request
This person must write software that will translate the search request into the Hyrax server's (to-be developed) aggregation request syntax.
A person making a data request to the Hyrax server
This actor will need to a similar understanding of the new aggregation mechanism as the 'developer' but will be writing the URLs 'by hand'.
The Hyrax data server
This is an important actor because the server will have to be updated to realize the new software's benefits.
Level 2 satellite swath data, stored in multiple files (file == granule == 'pass')
This use case is limited to a particular kind of data, although the resulting software might be useful for Time Series data too.

Preconditions

The data to be accessed are served using Hyrax
Yup
We may require that the CF option of the HDF4 handler be turned on
This will make the data values much easier to use because access will be as if we're working with a netCDF3 file - no hierarchy to translate, no weird Sequence of Structure variables that correlate to dependent var dimensions.
The Hyrax server has been updated to include the aggregation function
Ditto, but this will require that system admins install the code.
The user's software understands the structure of the returned data values
Again, Yup. In this use case that means the data are in returned in netCDF files, all of which are wrapped up in a tar ball that is compressed using gzip.
The data are indexed by a search system
The main use-case - the scenario where a search system would normally return a list of URLs instead returns a single URL that will call the aggregation function that will return all of the granules' data in one shot.
The user invoking the aggregation function understands the organization of the data
This is actually a variant of the main use-case because the in the variant scenario the user builds the URL that calls the aggregation function and not a search system

Triggers

  • A user searches for data and the result indicates that the data they want are spread over a number of granules.
  • A user knows they want data that are (or may be) spread over a number of granules in a dataset.

Basic Flow

A user (the actor that initiates the use case) performs a search using EDSC and the result set contains two or more granules. The search client would normally return a list of URLs to the discrete granules that make up the result set. However, in this use case, the client has been programed to recognize this situation and will respond by forming a URL that will run the aggregation server-side operation and request the aggregated data be returned as a list of CSV data points.

The server's internal software will build the response's table encoded as a DAP Sequence; the server will transform that into CSV if that is part of the request URL.

To formulate a request, the client will need to provide three pieces of information:

  • The granules to consider when building the aggregation - explicitly enumerated;
  • The (dependent) variables within those granules to include in the aggregation;
  • The space and time bounds (the independent variables) that will be used to constraint values of the dependent variables; and
  • Assumption: The independent and dependent variables are all present in the granules.

These parameters will be passed to the server using some sort of a constraint. Since each granule must be listed separately, POST will be used to make the request. Because we know that the aggregation operation will be be working with Level 2 satellite swath (geospatial and temporal) data, some optimizations can be made. We know that latitude, longitude and time are encoded in these granules for each sample point. Thus the cases where time is encoded in an attribute or the granule name don't need to be addressed. (They might be addressed by a future version of the code, however.)

Because of variations in time representation, we may adopt ISO8601 as the only way to specify time.

The web service end point used to access the aggregation will take a Request Document using HTTP POST. The request will contain the list of granules in its body, one granule per line. The remaining parameters will be passed in using the HTTP Query String (i.e., as keyname-value pairs).

Using this web service will look something like
http://host/server/aggregator?d4_func=table(vars)&d4_ce=constraints&return_as=csv

where vars and sub-expressions are:

vars
A list of variables as they are named in the granules (e.g., Cloud_Mask_QA, Mass_Concentration_Land)
constraints
A list of strings that denote the bounds of values that limit the request in space and time. (e.g., "table|-120 <= Longitude < 20" where Longitude is an independent variable in the dataset).

The response from the request will be a collection of netCDF files where each file in the response will correspond to a granule in the request. Each file will contain the requested variables, subset to contain data covering the lat/lon/time bounds specified in the request. Note that because the variables will be arrays and swath data generally cross latitude and longitude lines at an angle, there may be some 'extra' data values in some of the returned variables. The collection of netCDF files will be bundled in a compressed tar file or a zip file.

Aside: The granule list

Because the search tool is used to build the list of granules and it performs 'point-in-box' (or point-in-polygon, which subsumes PIB) selection of granules, this web service will assume that every granule in the request document contains some data that should be included in the response.

Aside: data organization

NB: This is discussed in more detail below in the Notes section.

The Independent and dependent vars are all (with one important caveat) two-dimensional arrays. The mapping between a Lat/lon/time tuple and a dependent variable's value is made by using the same (i,j) indices to access both the independent and dependent variables.

For example, a granule might contain these arrays: independent vars: Lat[5][7]; Lon[5][7]; time[5][7] dependent vars: SST[5][7]; Wind_Speed[5][7]

To get the wind speed at a given time, lat and lon, find those values in the independent vars, note the indices and then access the Wind_Speed array at those index values.

The caveat is that most of the dependent vars in a L2 MODIS granule have three dimensions where the additional dim is some other independent variable. There are two ways we can accommodate that, but both require some moderately detailed knowledge on the part of the user:

  • We can accommodate dependent vars with an extra dimension like [MODIS_band][][] by using something like "MODIS_Band == 440"
  • We can allow a dependent var to be specified using a partial dimension list and use that as a subset. For example, given Mean_Reflectance_Land[MODIS_Band_Land = 7][Cell_Along_Swath = 203][Cell_Across_Swath = 135] where Cell_Along_Swath and Cell_Across_Swath match the indices of Latitude, Longitude and ...time..., the dependent variable can be given using Mean_Reflectance_Land[0] or in general Mean_Reflectance_Land[projection expression]

To make this work, the value(s) associated with the 'extra dimension' will be turned into columns in the response table.

Regarding time

Level 2 data contain samples at various lat/lon points made over time. Each granule has a start time and end time, as does any temporally-contiguous set of granules. If we think about a request for all data that fall within two points on a time line, then the set of potential files with data can be thought of as having three kinds of elements: The file that contains the starting time, along with zero or more previous times; the file that contains the ending time, along with zero or more later times; and the 'interior' files where all of the data are within the selection bounds.

Alternate Flow

There are a number of alternate flows involving errors, all of which involve invalid parameters or granules that fail in some way.

A significant alternate flow is that a user can build the URL that makes the request themselves. Nothing about the request or the response changes, however. The significant difference is that a computer program does not have to figure out how to make the URL. In the main flow of this use-case, it does.

Post Conditions

The client will have the requested data, in a collection of netCDF files.

Activity Diagram

[skipped]

Notes

See Satellite_Swath_Data_Aggregation#Notes for information about the L2 data files.

Resources

In order to support the capabilities described in this Use Case, a set of resources must be available and/or configured. These resources include data and services, and the systems that offer them. This section will call out examples of these resources.


Resource Owner Description Availability Source System
name Organization that owns/ manages resource Short description of the resource How often the resource is available Name of system which provides resource