NCML Module Aggregation JoinExisting: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 164: Line 164:


The metadata for the coordinate variable is pulled from the ''first'' granule dataset.  Modification of coordinate variable metadata is not fully supported yet.
The metadata for the coordinate variable is pulled from the ''first'' granule dataset.  Modification of coordinate variable metadata is not fully supported yet.
== Future Work ==
The most important addition in the next release will be the ability to efficiently load and cache the granule dimension sizes offline to allow for a highly interactive user experience when data is requested.  This means that the ''ncoords'' attribute will no longer be required on the granules, although doing so results in a more efficient aggregation speed. 
We also plan to allow the data and metadata for the join dimension coordinate variable to be overridden and modified.

Revision as of 20:15, 19 October 2011


Join Existing Aggregation

A joinExisting aggregation joins multiple granule datasets by concatenating the specified outer dimensional data from the granules into the output. This results in matrices of the same number of dimensions, but with larger outer dimension cardinality. The outer dimension sizes of the granules may vary across granule, but any inner dimensions for multi-dimensional data still are required to match.

The reader is also directed to a basic tutorial of this NcML aggregation which may be found at http://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/Aggregation.html#joinExisting. Note that version 1.1.0 of the module does not support all features of joinExisting! Future versions will add more features.

PLEASE NOTE that our syntax is slightly different than that of the THREDDS Data Server (TDS), so please refer to this tutorial when using the Hyrax NcML Module! In particular, we do not process the <aggregation> element prior to other elements in the dataset, so in some cases the relative ordering of the <aggregation> and references to variables within the aggregation matters.

Introduction

This page describes the behavior of the initial implementation of joinExisting for version 1.2.x of the NcML Module, bundled with Hyrax 1.8. It is a limited feature set described below. Please see the Limitations section for more information.

In version 1.2.x, a joinExisting aggregation may be specified in three ways:

  • Using explicit lists of netcdf elements with the the ncoords attribute correctly specified for all of them.
  • Leaving off the ncoords attribute for all of the netcdf elements.
  • Using a scan element with ncoords specified and all matching granule datasets having this dimension size

Our example below will clarify this.

Future versions of the module will implement more of the joinExisting feature set.

Examples

Here we give an example that illustrates the functionality offered by the current version of the aggregation. This example may also be found on:

http://test.opendap.org:8090/opendap/ioos/mday_joinExist.ncml

with the data granules located in

http://test.opendap.org:8090/opendap/coverage/mday/

Granules

Assume we have some number of granule datasets with a DDS the same as the following (modulo the dataset name):

Dataset {
    Grid {
      Array:
        Float32 PHssta[time = 1][altitude = 1][lat = 4096][lon = 8192];
      Maps:
        Float64 time[time = 1];
        Float64 altitude[altitude = 1];
        Float64 lat[lat = 4096];
        Float64 lon[lon = 8192];
    } PHssta;
} PH2006001_2006031_ssta.nc;

Explicit Listing of Granules

We see that here time is the outer dimension, which is the only dimension we may join along (it is an error to specify an inner). Given some number of granules with this same shape, consider the following explicit joinExisting aggregation:

<?xml version="1.0" encoding="UTF-8"?>

<netcdf title="joinExisting test on netcdf Grid granules">
  
  <aggregation type="joinExisting" 
	       dimName="time" >

    <!-- Note explicit use of ncoords specifying size of "time" in file required -->
    <netcdf location="/coverage/mday/PH2006001_2006031_ssta.nc"
                   ncoords="1"/>
    <netcdf location="/coverage/mday/PH2006032_2006059_ssta.nc"
                   ncoords="1"/>
    <netcdf location="/coverage/mday/PH2006060_2006090_ssta.nc"
                   ncoords="1"/>

  </aggregation>
  
</netcdf>

First, note that the ncoords attribute should be specified on the individual granules for this version of the module. In many cases the handler will be more efficient if the ncoords attribute is used. Note that we also specify the dimName. Any data array whose outer dimension is called this will be subject to aggregation in the output.

Serving this from Hyrax will result in the following DDS:

Dataset {
    Grid {
      Array:
        Float32 PHssta[time = 3][altitude = 1][lat = 4096][lon = 8192];
      Maps:
        Float64 time[time = 3];
        Float64 altitude[altitude = 1];
        Float64 lat[lat = 4096];
        Float64 lon[lon = 8192];
    } PHssta;
    Float64 time[time = 3];
} mday_joinExist.ncml;

We see that the time dimension is now of size 3 to match that we joined three granule datasets together.

Also notice that the map vector for the joined dimension, time, has been duplicated as a sibling of the dataset. This is done automatically by the aggregation and it is copied into the actual map of the Grid. This copy is to facilitate datasets which have multiple Grid's that are to be joined --- the top-level map vector is used as the canonical template map which is then copied into the maps for all the aggregated Grids. In the case of the joined data being of type Array, this vector would already exist as the coordinate variable for the data matrix. Since this is the source map for all aggregated Grid's, any attribute (metadata) changes should be made explicitly on this top-level coordinate variable so that the metadata is shared among all the aggregated Grid map vectors.

Using the scan element with ncoords extension

Since all of the granules are of uniform size, we may also use the syntactic sugar provided by a Hyrax-specific extension to NcML -- adding the ncoords attribute to a scan element. The behavior of this extension is to set the ncoords for each granule matching the scan to be this value, as if the datasets were each listed explicitly with this value of the attribute. Here's an example of using the syntactic sugar that results in the same exact aggregation as the previous explicit one:

<?xml version="1.0" encoding="UTF-8"?>
<!-- joinExisting test on netcdf granules using scan@ncoords extension-->
<netcdf title="joinExisting test on netcdf Grid granules using scan@ncoords"
	>
  
  <attribute name="Description" type="string"
	     value=" joinExisting test on netcdf Grid granules using scan@ncoords"/>

  <aggregation type="joinExisting" 
	       dimName="time" >

    <!-- Filenames have lexicographic and chronological ordering match -->
    <scan location="/coverage/mday"
	  subdirs="false"
	  suffix=".nc"
	  ncoords="1"
	  />

  </aggregation>
  
</netcdf>

which we see results in the same DDS:

Dataset {
    Grid {
      Array:
        Float32 PHssta[time = 3][altitude = 1][lat = 4096][lon = 8192];
      Maps:
        Float64 time[time = 3];
        Float64 altitude[altitude = 1];
        Float64 lat[lat = 4096];
        Float64 lon[lon = 8192];
    } PHssta;
    Float64 time[time = 3];
} mday_joinExist.ncml;


Limitations

The current version implements only basic functionality. If there is extended functionality that is needed for your use, please send <mailto:support@opendap.org> to let us know!


Join Dimension Sizes Must Be Explicitly Declared

As we have seen, the most important limitation is that the ncoords must be specified for efficiency reasons. Future versions will relax this requirement. The problem is that the size of the output join dimension is dependent on checking the DDS of every granule in the aggregation, which is computationally expensive for large aggregations unless it is computed once and cached. The next feature release of the module will include a caching system for solving this general problem and allowing the ncoords to be loaded automatically rather than being specified.


Source of Data for Agrgegated Coordinate Variable on Join Dimension

This version does not allow the join dimension's data to be declared explicitly in the NcML as the NcML tutorial page describes. This version automatically aggregates all variables with the outer dimension matching the dimName. This includes the coordinate variable (map vector in the case of Grid's) for the join dimension. These data cannot be overridden from those pulled from the files. Currently the TDS lists about 5 ways this data can be specified in addition to pulling them from the granules --- we only can pull them from granules now, which seems the most common use.

Source of Join Dimension Metadata

The metadata for the coordinate variable is pulled from the first granule dataset. Modification of coordinate variable metadata is not fully supported yet.