NCML Module Aggregation JoinNew

From OPeNDAP Documentation
Revision as of 23:27, 8 July 2015 by Jimg (talk | contribs) (→‎Join New Aggregation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
⧼opendap2-jumptonavigation⧽


Join New Aggregation

A joinNew aggregation joins existing datasets along a new outer Array dimension. Essentially, it adds a new index to the existing variable which points into the values in each member dataset. One useful example of this aggregation is for joining multiple samples of data from different times into one virtual dataset containing all the times. We will first provide a basic introduction to the joinNew aggregation, then demonstrate examples for the various ways to specify the members datasets of an aggregation, the values for the new dimension's coordinate variable (map vector), and ways to specify metadata for this aggregation.

The reader is also directed to a basic tutorial of this NcML aggregation which may be found at http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/Aggregation.html#joinNew

PLEASE NOTE that our syntax is slightly different than that of the THREDDS Data Server (TDS), so please refer to this tutorial when using the Hyrax NcML Module! In particular, we do not process the <aggregation> element prior to other elements in the dataset, so in some cases the relative ordering of the <aggregation> and references to variables within the aggregation matters.

Introduction

A joinNew aggregation combines a variable with data across n datasets by creating a new outer dimension and placing the data from aggregation member i into the element i of the new outer dimension of size n. By "outer dimension" we mean a slowest varying dimension in a row major order flattening of the data (an example later will clarify this). For example, the array A[day][sample] would have the day dimension as the outer dimension. The data samples all must have the same data syntax; specifically the DDS of the variables must all match. For example, if the aggregation variable has name sample and is a 10x10 Array of float32, then all the member datasets in the aggregation must include a variable named sample which are all also 10x10 Arrays of float32. If there were 100 datasets specified in the aggregation, the resulting DDS would contain a variable named sample that was now of data shape 100x10x10.

In addition, a new coordinate variable specifying data values for the new dimension will be created at the same scope as (a sibling of) the specified aggregation variable. For example, if the new dimension is called "filename" and the new dimension's values are unspecified (the default), then an Array of type String will be created with one element for each member dataset --- the filename of the dataset. Additionally, if the aggregation variable was represented as a DAP Grid, this new dimension coordinate variable will also be added as a new Map vector inside the Grid to maintain the Grid specification.

There are multiple ways to specify the member datasets of a joinNew aggregation:

  • Explicit: Specifying a separate <netcdf> element for each dataset
  • Scan: scan a directory tree for files matching a conjunction of certain criteria:
    • Specific suffix
    • Older than a specific duration
    • Matching a specific regular expression
    • Either in a specific directory or recursively searching subdirectories

Additionally, there are multiple ways to specify the new coordinate variable's (the new outer dimension's associated data variable) data values:

  • Default: An Array of type String containing the filenames of the member datasets
  • Explicit Value Array: Explicit list of values of a specific data type, exactly one per dataset
  • Dynamic Array: a numeric Array variable specified using start and increment values -- one value is generated automatically per dataset
  • Timestamp from Filename: An Array of String with values of ISO 8601 Timestamps extracted from scanned dataset filenames using a specified Java SimpleDataFormat string. (Only works with <scan> element!)

A Simple Self-Contained Example

First, we start with a simple purely virtual (no external datasets) example to give the reader a basic idea of this aggregation. This example will join two one-dimensional Arrays of int's of length 5. The variable they describe will be called V. In this example, we assume we are joining samples of some variable V where each dataset is samples from 5 stations on a single day. We want to join the datasets so the new outer dimension is the day, resulting in a 2x5 array of int values for V.

Here's our NcML, with comments to describe what we are doing:

<?xml version="1.0" encoding="UTF-8"?>

<!-- A simple pure virtual joinNew aggregation of type Array<int>[5][2]  -->

<netcdf title="Sample joinNew Aggregation on Pure NCML Datasets">
  
  <!-- joinNew forming new outer dimension "day" -->
  <aggregation type="joinNew" dimName="day">
    
    <!-- For variables with this name in child datasets -->
    <variableAgg name="V"/>

    <!-- Datasets are one-dimensional Array<int> with cardinality 5. -->
    <netcdf title="Sample Slice 1">
      <!-- Must forward declare the dimension size -->
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>1 3 5 7 9</values>
      </variable>
    </netcdf>

    <!-- Second slice must match shape! -->
    <netcdf title="Sample Slice 2">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>2 4 6 8 10</values>
      </variable>
    </netcdf>

  </aggregation>

<!-- This will be what the expected output aggregation will look like.
       We can use the named dimensions for the shape here since the aggregation
       comes first and the dimensions will be added to the parent dataset by now -->
  <variable name="V_expected" type="int" shape="day station">
    <!-- Row major values.  Since we create a new outer dimension, the slices are concatenated
        since the outer dimension varies the slowest in row major order.  This gives a 2x5 Array.
	 We use the newline to show the dimension separation for the reader's benefit -->
    <values>
      1 3 5 7 9 
      2 4 6 8 10
    </values>
  </variable>

</netcdf>

Notice we specify the name of the aggregation variable V inside the aggregation using a <variableAgg> element --- this allows to to specify multiple variables in the datasets to join. The new dimension, however, is specified by the attribute dimName of <aggregation>. We do NOT need to specify a <dimension> element for the new dimension (in fact, it would be an error to do so). Its size is calculated based on the number of datasets in the aggregation.

Running this file through the module produces the following DDS:

Dataset {
    Int32 V[day = 2][station = 5];
    Int32 V_expected[day = 2][station = 5];
    String day[day = 2];
} joinNew_virtual.ncml;

Notice how the new dimension caused a coordinate variable to be created with the same name and shape as the new dimension. This array will contain the default values for the new outer dimension's map as we shall see if we ask for the ASCII version of the DODS (data) response:

The data:
Int32 V[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
Int32 V_expected[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
String day[day = 2] = {"Virtual_Dataset_0", "Virtual_Dataset_1"};

We see that the resulting aggregation data matches what we expected to create, specified by our V_expected variable. Also, notice that the values for the coordinate variable are "Virtual_Dataset_i", where i is the number of the dataset. Since the datasets did not have the location attribute set (which would have been used if it was), the module generates unique names for the virtual datasets in the output.

We could also have specified the value for the dataset using the netcdf@coordValue attribute:

<?xml version="1.0" encoding="UTF-8"?>

<netcdf title="Sample joinNew Aggregation on Pure NCML Datasets">
  
    <aggregation type="joinNew" dimName="day">
    <variableAgg name="V"/>

    <netcdf title="Sample Slice 1" coordValue="100">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>1 3 5 7 9</values>
      </variable>
    </netcdf>

    <netcdf title="Sample Slice 2" coordValue="107">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>2 4 6 8 10</values>
      </variable>
    </netcdf>

  </aggregation>
</netcdf>

This results in the ASCII DODS of:

The data:
Int32 V[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
Float64 day[day = 2] = {100, 107};

Since the coordValue's could be parsed numerically, the coordinate variable is of type double (Float64). If they could not be parsed numerically, then the variable would be of type String.

Now that the reader has an idea of the basics of the joinNew aggregation, we will create examples for the many different use cases the NcML aggregation author may wish to create.


A Simple Example Using Explicit Dataset Files

Using virtual datasets is not that common. More commonly, the aggregation author wants to specify files for the aggregation. As an introductory example of this, we'll create a simple aggregation explicitly listing the files and giving string coordValue's. Note that this is a contrived example: we are using the same dataset file for each member, but changing the coordValue's. Also notice that we have specified that both the u and v variables be aggregated using the same new dimension name source.

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Aggregation with explicit string coordValue.">
  
  <aggregation type="joinNew" dimName="source">    
    <variableAgg name="u"/>
    <variableAgg name="v"/>

    <!-- Same dataset a few times, but with different coordVal -->
    <netcdf title="Dataset 1" location="data/ncml/fnoc1.nc" coordValue="Station_1"/>
    <netcdf title="Dataset 2" location="data/ncml/fnoc1.nc" coordValue="Station_2"/>
    <netcdf title="Dataset 3" location="data/ncml/fnoc1.nc" coordValue="Station_3"/>

  </aggregation>

</netcdf>

which produces the DDS:

Dataset {
    Int16 u[source = 3][time_a = 16][lat = 17][lon = 21];
    Int16 v[source = 3][time_a = 16][lat = 17][lon = 21];
    Float32 lat[lat = 17];
    Float32 lon[lon = 21];
    Float32 time[time = 16];
    String source[source = 3];
} joinNew_string_coordVal.ncml;


Since there's so much data we only show the new coordinate variable:

String source[source = 3] = {"Station_1", "Station_2", "Station_3"};

Also notice that other coordinate variables (lat, lon, time) already existed in the datasets along with the u and v arrays. Any variable that is not aggregated over (specified as an aggregationVar) is explicitly union aggregated (please see NCML_Module_Aggregation_Union) into the resulting dataset --- the first instance of every variable found in the order the datasets are listed is used.

Now that we've seen simple cases, let's look at more complex examples.


Examples of Explicit Dataset Listings

In this section we will give several examples of joinNew aggregation with a static, explicit list of member datasets. In particular, we will go over examples of:

  • Default values for the new coordinate variable
  • Explicitly setting values of any type on the new coordinate variable
  • Autogenerating uniform numeric values for the new coordinate variable
  • Explicitly setting String or double values using the netcdf@coordValue attribute

There are several ways to specify values for the new coordinate variable of the new outer dimension. If String or double values are sufficient, the author may set the value for each listed dataset using the netcdf@coordValue attribute for each dataset. If another type is required for the new coordinate variable, then the author has a choice of specifying the entire new coordinate variable explicitly (which must match dimensionality of the aggregated dimension) or using the start/increment autogeneration <values> element for numeric, evenly spaced samples.

Please see the turotial: JoinNew Explicit Dataset Tutorial

Adding/Modifying Metadata on Aggregations

It is possible to add or modify metadata on existing or new variables in an aggregation. The syntax for these varies somewhat, so we give examples of the different cases.

We will also give examples of providing metadata:

  • Adding/modifying metadata to the new coordinate variable
  • Adding/modifying metadata to the aggregation variable itself
  • Adding/modifying metadata to existing maps in an aggregated Grid

Please see the Metadata on Aggregations Tutorial.

Dynamic Aggregations Using Directory Scanning

A powerful way to create dynamic aggregations rather than by listing datasets explicitly is by specifying a data directory where aggregation member datasets are stored and some criteria for which files are to be added to the aggregation. These criteria will be combined in a conjunction (an AND operator) to handle various types of searches. The way to specify datasets in an aggregation is by using the <scan> element inside the <aggregation> element.

A key benefit of using the <scan> element is that the NcML file need not change as new datasets are added to the aggregation, say by an automated process which simply writes new data files into a specific directory. By properly specifying the NcML aggregation with a scan, the same NcML will refer to a dynamically changing aggregation, staying up to date with current data, without the need for modifications to the NcML file itself. If the filenames have a timestamp encoded in them, the use of the dateFormatMark allows for automatic creation of the new coordinate variable data values as well, as shown below.

The scan element may be used to search a directory to find files that match the following criteria:

  • Suffix : the aggregated files end in a specific suffix, indicating the file type
  • Subdirectories: any subdirectories of the given location are to be searched and all regular files tested against the criteria
  • Older Than: the aggregated files must have been modified longer than some duration ago (to exclude files that may be currently being written)
  • Reg Exp: the aggregated file pathnames must match a specific regular expression
  • Date Format Mark: this highly useful criterion, useful in conjunction with others, allows the specification of a pattern in the filename which encodes a timestamp. The timestamp is extracted from the filenames using the pattern and is used to create ISO 8601 date elements for the new dimension's coordinate variable.

We will give examples of each of these criteria in use in our tutorial. Again, if more than one is specified, then ALL must match for the file to be included in the aggregation.

Please see the Dynamic Aggregation Tutorial for a more detailed discussion