BES Aggregation using NcML

From OPeNDAP Documentation
Revision as of 05:04, 31 December 2008 by Jimg (talk | contribs)
⧼opendap2-jumptonavigation⧽

There are three main scenarios for aggregation that we have encountered so far in addition to the simple situation where a group of otherwise discreet variables are combined in a Structure to be manipulated as a single variable. Those cases are tiling, combination of parameters held in separate files and grouping M N-dimensional variables into a single N+1-dimensional variable. A fourth form of aggregation which has emerged is 'tiling in time.' That is, it is essentially tiling but it is useful to separate tiling in the abstract sense from two common cases: tiling over latitude and longitude and tiling over time. Both latitude and longitude are periodic, so tiling needs to take this into account. Time, on the other hand, is generally not periodic (although climatologies could be stored in separate files and tiled along their time dimension).

In all cases several discreet data sets (i.e., URLs) are combined into a single data set; The discreet data sets are still capable of being referenced using their DAP URLs, but the result of the aggregation is a new DAP URL which references the aggregate. Building such a data set makes it possible to effectively query for certain data sets using the DAP constraint expression, making the query-selection indistinguishable from a data access operation. In most cases, data access replaces a two step operation where first specific data sets are chosen and then each is individually sampled and the results combined within a client. Thus aggregation really solves two different problems. First, choosing from among many discreet items when those are more appropriately viewed as a logical whole. It addresses this by mapping the data set selection (i.e., query) problem into a data access problem. Secondly, the increased client complexity needed to manage the differing ways data inventories must be managed and the resulting responses manipulated to yield a single data object.

One issue that this page is side-stepping is using NcML to address the AIS Problem. We have long known that a way to address problems with data sets not meeting a given set of metadata requirements can be addressed by providing those new metadata items using a data store that is read by the server that then combines the new metadata with the data and returns the result to a client (probably with an annotation that the original contents of the data set has been augmented). NcML has been designed with this in mind and it seems obvious that if we choose to adopt NcML to specify aggregations, and thus write software to read and process it, then we should complete the picture and code support for its AIS capabilities too. We should but that should be included in a separate project unless we need those feature to complete aggregation itself.

Increasing dimensionality

The most common example of this is the combination of a large set of satellite images, all of which are regular with respect to one another in latitude and longitude, which span some time range. The image are not generally equally spaced in time, however. The result is a N+1 dimensional grid variable with a new Grid Map that must be 'synthesized' using date/time information from some source (an attribute stored in the image file or the file name itself).

Tiling

There are two common cases of tiling: Tiling images over lat and long extents and tiling periodic measurements taken over time. In each case the data are broken down into parts (tiles) for storage and/or access efficiency reasons (although this is less and less a compelling reason given that storage formats like HDF4/5 and NetCDF 4 support internal compression and tiling and updates). In the case of measurements taken over time, it might be that a months data are stored in a single file but that the entire data set spans several years.

Combining parameters

In this case each parameter (e.g., salinity) is stored in a separate file. That is, each file is essentially a column in a table and the aggregation operation brings them together so that they can be accessed with their relations made explicit.

Use Cases

  1. Specifying an aggregation using NcML This use case can be applied to any of the three/four aggregation types. Lets write a generic use case and see if we need to write separate use cases for each
  1. Making an aggregation visible using the directory browsing features
  1. Aggregating several URLs where attributes' values change Here we need to determine if the attributes should be promoted to a variable or if they should be elided or turned into a vector of attributes
  1. When the aggregated items fail to meet necessary uniformity criteria How do we handle the case where someone tried to aggregate a bunch of images into a N+1 dimensional Grid when the images are not co-located in lat/long?

Definitions

NcML
A markup language used by Unidata's TDS to describe aggregations and to provide features similar to our AIS software.
Aggregation
The combination of the variables in two or more DAP URLs to a new variable/s that can be accessed using a single DAP URL. This definition includes the three types of aggregation described above plus the 'Structure' aggregation already built into the BES but which can be accessed only for netCDF files.

Background

Aggregation has long been known to be an important feature remote/distributed access systems.

Deliverables

A BES with the capability to perform the three aggregations discussed above. We might implement this using a new module which can aggregate any DAP data source that can be accessed by the BES and we might choose to code the aggregations using NcML which would make our code a drop in replacement for the TDS.

Period of use

We expect this project to be largely completed within six months of starting. The code should be releasable by the Summer ESIP meeting (June 2009). Once done, the software will be one of core strengths of the Hyrax server.