NCML Module Aggregation JoinNew

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽


Join New Aggregation

A joinNew aggregation joins existing datasets along a new outer Array dimension. Essentially, it adds a new index to the existing variable which points into the values in each member dataset. One useful example of this aggregation is for joining multiple samples of data from different times into one virtual dataset containing all the times. We will first provide a basic introduction to the joinNew aggregation, then demonstrate examples for the various ways to specify the members datasets of an aggregation, the values for the new dimension's coordinate variable (map vector), and ways to specify metadata for this aggregation.

The reader is also directed to a basic tutorial of this NcML aggregation which may be found at http://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/Aggregation.html#joinNew.

PLEASE NOTE that our syntax is slightly different than that of the THREDDS Data Server (TDS), so please refer to this tutorial when using the Hyrax NcML Module!

Introduction

A joinNew aggregation combines a variable with data across n datasets by creating a new outer dimension and placing the data from aggregation member i into the element i of the new outer dimension. By "outer dimension" we mean a slowest varying dimension in a row major order flattening of the data (an example later will clarify this). For example, the array A[day][sample] would have the day dimension as the outer dimension. The data samples all must have the same data syntax; specifically the DDS of the variables must all match. For example, if the aggregation variable has name sample and is a 10x10 Array of float32, then all the member datasets in the aggregation must include a variable named sample which are all also 10x10 Arrays of float32. If there were 100 datasets specified in the aggregation, the resulting DDS would contain a variable named sample that was now of data shape 100x10x10.

In addition, a new variable specifying data values for the new dimension will be created at the same scope as (a sibling of) the specified aggregation variable. For example, if the new dimension is called "filename" and the new dimension's values are unspecified (the default), then an Array of type String will be created with one element for each member dataset --- the filename of the dataset. Additionally, if the aggregation variable was a DAP Grid, this new dimension data variable will also be added as a new Map vector inside the Grid to maintain the Grid specification.

There are multiple ways to specify the member datasets of a joinNew aggregation:

  • Explicit: using <netcdf> elements
  • Scan: scan a directory tree for files matching a conjunction of certain criteria:
    • Specific suffix
    • Older than a specific duration
    • Matching a specific regular expression
    • Either in a specific directory or recursively searching subdirectories

Additionally, there are multiple ways to specify the new coordinate variable's (the new outer dimension's associated data variable) data values:

  • Default: An Array of type String containing the filenames of the member datasets
  • Explicit Value Array: Explicit list of values of a specific data type, exactly one per dataset
  • Dynamic Array: a numeric Array variable specified using start and increment values -- one value is generated automatically per dataset
  • Timestamp from Filename: An Array of String with values of ISO 8601 Timestamps extracted from scanned dataset filenames using a specified Java SimpleDataFormat string. (Only works with <scan> element!)

A Simple Self-Contained Example

First, we start with a simple purely virtual (no external datasets) example to give the reader a basic idea of this aggregation. This example will join two one-dimensional Arrays of int's of length 5. The variable they describe will be called V. In this example, we assume we are joining samples of some variable V where each dataset is samples from 5 stations on a single day. We want to join the datasets so the new outer dimension is the day, resulting in a 2x5 array of int values for V.

Here's our NcML, with comments to describe what we are doing:

<?xml version="1.0" encoding="UTF-8"?>

<!-- A simple pure virtual joinNew aggregation of type Array<int>[5][2]  -->

<netcdf title="Sample joinNew Aggregation on Pure NCML Datasets">
  
  <!-- joinNew forming new outer dimension "day" -->
  <aggregation type="joinNew" dimName="day">
    
    <!-- For variables with this name in child datasets -->
    <variableAgg name="V"/>

    <!-- Datasets are one-dimensional Array<int> with cardinality 5. -->
    <netcdf title="Sample Slice 1">
      <!-- Must forward declare the dimension size -->
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>1 3 5 7 9</values>
      </variable>
    </netcdf>

    <!-- Second slice must match shape! -->
    <netcdf title="Sample Slice 2">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>2 4 6 8 10</values>
      </variable>
    </netcdf>

  </aggregation>

<!-- This will be what the expected output aggregation will look like.
       We can use the named dimensions for the shape here since the aggregation
       comes first and the dimensions will be added to the parent dataset by now -->
  <variable name="V_expected" type="int" shape="day station">
    <!-- Row major values.  Since we create a new outer dimension, the slices are concatenated
        since the outer dimension varies the slowest in row major order.  This gives a 2x5 Array.
	 We use the newline to show the dimension separation for the reader's benefit -->
    <values>
      1 3 5 7 9 
      2 4 6 8 10
    </values>
  </variable>

</netcdf>

Notice we specify the name of the aggregation variable V inside the aggregation using a <variableAgg> element --- this allows to to specify multiple variables in the datasets to join. The new dimension, however, is specified by the attribute dimName of <aggregation>. We do NOT need to specify a <dimension> element for the new dimension (in fact, it would be an error to do so). Its size is calculated based on the number of datasets in the aggregation.

Running this file through the module produces the following DDS:

Dataset {
    Int32 V[day = 2][station = 5];
    Int32 V_expected[day = 2][station = 5];
    String day[day = 2];
} joinNew_virtual.ncml;

Notice how the new dimension caused a coordinate variable to be created with the same name and shape as the new dimension. This array will contain the default values for the new outer dimension's map as we shall see if we ask for the ASCII version of the DODS (data) response:

The data:
Int32 V[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
Int32 V_expected[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
String day[day = 2] = {"Virtual_Dataset_0", "Virtual_Dataset_1"};

We see that the resulting aggregation data matches what we expected to create, specified by our V_expected variable. Also, notice that the values for the coordinate variable are "Virtual_Dataset_i", where i is the number of the dataset. Since the datasets did not have the location attribute set (which would have been used if it was), the module generates unique names for the virtual datasets in the output.

We could also have specified the value for the dataset using the netcdf@coordValue attribute:

<?xml version="1.0" encoding="UTF-8"?>

<netcdf title="Sample joinNew Aggregation on Pure NCML Datasets">
  
    <aggregation type="joinNew" dimName="day">
    <variableAgg name="V"/>

    <netcdf title="Sample Slice 1" coordValue="100">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>1 3 5 7 9</values>
      </variable>
    </netcdf>

    <netcdf title="Sample Slice 2" coordValue="107">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
	<values>2 4 6 8 10</values>
      </variable>
    </netcdf>

  </aggregation>
</netcdf>

This results in the ASCII DODS of:

The data:
Int32 V[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
Float64 day[day = 2] = {100, 107};

Since the coordValue's could be parsed numerically, the coordinate variable is of type double (Float64). If they could not be parsed numerically, then the variable would be of type String.

Now that the reader has an idea of the basics of the joinNew aggregation, we will create examples for the many different use cases the NcML aggregation author may wish to create.


A Simple Example Using Explicit Dataset Files

Using virtual datasets is not that common. More commonly, the aggregation author wants to specify files for the aggregation. As an introductory example of this, we'll create a simple aggregation explicitly listing the files and giving string coordValue's. Note that this is a contrived example: we are using the same dataset file for each member, but changing the coordValue's. Also notice that we have specified that both the u and v variables be aggregated using the same new dimension name source.

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Aggregation with explicit string coordValue.">
  
  <aggregation type="joinNew" dimName="source">    
    <variableAgg name="u"/>
    <variableAgg name="v"/>

    <!-- Same dataset a few times, but with different coordVal -->
    <netcdf title="Dataset 1" location="data/ncml/fnoc1.nc" coordValue="Station_1"/>
    <netcdf title="Dataset 2" location="data/ncml/fnoc1.nc" coordValue="Station_2"/>
    <netcdf title="Dataset 3" location="data/ncml/fnoc1.nc" coordValue="Station_3"/>

  </aggregation>

</netcdf>

which produces the DDS:

Dataset {
    Int16 u[source = 3][time_a = 16][lat = 17][lon = 21];
    Int16 v[source = 3][time_a = 16][lat = 17][lon = 21];
    Float32 lat[lat = 17];
    Float32 lon[lon = 21];
    Float32 time[time = 16];
    String source[source = 3];
} joinNew_string_coordVal.ncml;


Since there's so much data we only show the new coordinate variable:

String source[source = 3] = {"Station_1", "Station_2", "Station_3"};

Also notice that other coordinate variables (lat, lon, time) already existed in the datasets along with the u and v arrays. Any variable that is not aggregated over (specified as an aggregationVar) is explicitly union aggregated (please see NCML_Module_Aggregation_Union) into the resulting dataset --- the first instance of every variable found in the order the datasets are listed is used.

Now that we've seen simple cases, let's look at more complex examples.


Examples of Explicit Dataset Listings

In this section we will give several examples of joinNew aggregation with a static, explicit list of member datasets. In particular, we will go over examples of:

  • Default values for the new coordinate variable
  • Explicitly setting values of any type on the new coordinate variable
  • Autogenerating uniform numeric values for the new coordinate variable
  • Explicitly setting String or double values using the netcdf@coordValue attribute

There are several ways to specify values for the new coordinate variable of the new outer dimension. If String or double values are sufficient, the author may set the value for each listed dataset using the netcdf@coordValue attribute for each dataset. If another type is required for the new coordinate variable, then the author has a choice of specifying the entire new coordinate variable explicitly (which must match dimensionality of the aggregated dimension) or using the start/increment autogeneration <values> element for numeric, evenly spaced samples.

Default Values for the New Coordinate Variable (on a Grid)

The default for the new coordinate variable is to be of type String with the location of the dataset as the value. For example, the following NcML file:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Simple test of joinNew Grid aggregation">
  
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/> 
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>  
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/> 
  </aggregation> 
  
</netcdf>

specifies an aggregation on a Grid variable dsp_band_1 sampled in four HDF4 datasets listed explicitly.

First, the data structure (DDS) is:

Dataset {
    Grid {
      Array:
        UInt32 dsp_band_1[filename = 4][lat = 1024][lon = 1024];
      Maps:
        String filename[filename = 4];
        Float64 lat[1024];
        Float64 lon[1024];
    } dsp_band_1;
    String filename[filename = 4];
} joinNew_grid.ncml;

We see the aggregated variable dsp_band_1 has the new outer dimension filename. A coordinate variable filename[filename]' was created as a sibling of the aggregated variable (the top level Grid we specified) and was also copied into the aggregated Grid as a new map vector.

The ASCII data response for just the new coordinate variable filename[filename] is:

String filename[filename = 4] = {"data/ncml/agg/grids/f97182070958.hdf", 
"data/ncml/agg/grids/f97182183448.hdf", 
"data/ncml/agg/grids/f97183065853.hdf", 
"data/ncml/agg/grids/f97183182355.hdf"};

We see that the absolute location we specified for the dataset as a String is the value for each element of the new coordinate variable.

The newly added map dsp_band_1.filename contains a copy of this data.

Explicitly Specifying the New Coordinate Variable

If the author wishes to have the new coordinate variable be of a specific data type with non-uniform values, then they must specify the new coordinate variable explicitly.

Array Virtual Dataset

Here's an example using a contrived pure virtual dataset:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="JoinNew on Array with Explicit Map">

  <!-- joinNew and form new outer dimension "day" -->
  <aggregation type="joinNew" dimName="day">
    <variableAgg name="V"/>

    <netcdf title="Slice 1">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
	<values>1 2 3</values>
      </variable>
    </netcdf>

    <netcdf title="Slice 2">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
	<values>4 5 6</values>
      </variable>
    </netcdf>

  </aggregation>

  <!-- This is recognized as the definition of the new coordinate variable, 
       since it has the form day[day] where day is the dimName for the aggregation. 
       It MUST be specified after the aggregation, so that the dimension size of day
      has been calculated.
  -->
  <variable name="day" type="int" shape="day">
    <!-- Note: metadata may be added here as normal! -->
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
    <values>1 30</values>
  </variable>
	     
</netcdf>

The resulting DDS:

Dataset {
    Int32 V[day = 2][sensors = 3];
    Int32 day[day = 2];
} joinNew_with_explicit_map.ncml;

and the ASCII data:

Int32 V[day = 2][sensors = 3] = {{1, 2, 3},{4, 5, 6}};
Int32 day[day = 2] = {1, 30};

Note that the values we have explicitly given are used here as well as the specified NcML type, int which is mapped to a DAP Int32.

If metadata is desired on the new coordinate variable, it may be added just as in a normal new variable declaration. We'll give more examples of this later.

Grid with Explicit Map

Let's give one more example using a Grid to demonstrate the recognition of the coordinate variable as it is added to the Grid as the map vector for the new dimension:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation with explicit map">
  
  <aggregation type="joinNew" dimName="sample_time">
    <variableAgg name="dsp_band_1"/> 
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>  
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/> 
  </aggregation> 
  
  <!-- Note: values are contrived -->
  <variable name="sample_time" shape="sample_time" type="float">
    <!-- Metadata here will also show up in the Grid map -->
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
    <values>100 200 400 1000</values>
  </variable>

</netcdf>

This produces the DDS:

Dataset {
    Grid {
      Array:
        UInt32 dsp_band_1[sample_time = 4][lat = 1024][lon = 1024];
      Maps:
        Float32 sample_time[sample_time = 4];
        Float64 lat[1024];
        Float64 lon[1024];
    } dsp_band_1;
    Float32 sample_time[sample_time = 4];
} joinNew_grid_explicit_map.ncml;

You can see the explicit coordinate variable sample_time was found as the sibling of the aggregated Grid as was added as the new map vector for the Grid.

The values for the projected coordinate variables are as expected:

Float32 sample_time[sample_time = 4] = {100, 200, 400, 1000};

Errors

It is a Parse Error to:

  • Give a different number of values for the explicit coordinate variable than their are specified datasets
  • Specify the new coordinate variable prior to the <aggregation> element since the dimension size is not yet known


Autogenerated Uniform Numeric Values

If the number of datasets might vary (for example, if a <scan> element, described later, is used), but the values are uniform, the start/increment version of the <values> element may be used to generate the values for the new coordinate variable. For example:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="JoinNew on Array with Explicit Autogenerated Map">

  <aggregation type="joinNew" dimName="day">
    <variableAgg name="V"/>

    <netcdf title="Slice 1">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
	<values>1 2 3</values>
      </variable>
    </netcdf>

    <netcdf title="Slice 2">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
	<values>4 5 6</values>
      </variable>
    </netcdf>

  </aggregation>

  <!-- Explicit coordinate variable definition -->
  <variable name="day" type="int" shape="day">
    <attribute name="units" type="string" value="days since 2000-1-01 00:00"/>
    <!-- We sample once a week... -->
    <values start="1" increment="7"/>
  </variable>
	     
</netcdf>

The DDS is the same as before and the coordinate variable is generated as expected:

Int32 sample_time[sample_time = 4] = {1, 8, 15, 22};

Note that this form is useful for uniform sampled datasets (or if only a numeric index is desired) where the variable need not be changed as datasets are added. It is especially useful for a <scan> element that refers to a dynamic number of files that can be described with a uniformly varying index.

Explicitly Using coordValue Attribute of <netcdf>

The netcdf@coordValue may be used to specify the value for the given dataset right where the dataset is declared. This attribute will cause a coordinate variable to be automatically generated with the given values for each dataset filled in. The new coordinate variable will be of type double if the coordValue's can all be parsed as a number, otherwise they will be of type String.

String coordValue Example

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Aggregation with explicit string coordValue">
  
  <aggregation type="joinNew" dimName="source">
    <variableAgg name="u"/>
    <variableAgg name="v"/>

    <!-- Same dataset a few times, but with different coordVal -->
    <netcdf title="Dataset 1" location="data/ncml/fnoc1.nc" coordValue="Station_1"/>
    <netcdf title="Dataset 2" location="data/ncml/fnoc1.nc" coordValue="Station_2"/>
    <netcdf title="Dataset 3" location="data/ncml/fnoc1.nc" coordValue="Station_3"/>
  </aggregation>

</netcdf>

This results in the following DDS:

Dataset {
    Int16 u[source = 3][time_a = 16][lat = 17][lon = 21];
    Int16 v[source = 3][time_a = 16][lat = 17][lon = 21];
    Float32 lat[lat = 17];
    Float32 lon[lon = 21];
    Float32 time[time = 16];
    String source[source = 3];
} joinNew_string_coordVal.ncml;

and ASCII data response of the projected coordinate variable is:

String source[source = 3] = {"Station_1", "Station_2", "Station_3"};

as we specified.

Numeric (double) Use of coordValue

If the first coordValue can be successfully parsed as a double numeric type, then a coordinate variable of type double (Float64) is created and all remaining coordValue specifications must be parsable as a double or a Parse Error is thrown.

Using the same example but with numbers instead:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Aggregation with numeric coordValue">
  
  <aggregation type="joinNew" dimName="source">
    <variableAgg name="u"/>
    <variableAgg name="v"/>

    <!-- Same dataset a few times, but with different coordVal -->
    <netcdf title="Dataset 1" location="data/ncml/fnoc1.nc" coordValue="1.2"/>
    <netcdf title="Dataset 2" location="data/ncml/fnoc1.nc" coordValue="3.4"/>
    <netcdf title="Dataset 3" location="data/ncml/fnoc1.nc" coordValue="5.6"/>

  </aggregation>
</netcdf>

This time we see that a Float64 array is created:

Dataset {
    Int16 u[source = 3][time_a = 16][lat = 17][lon = 21];
    Int16 v[source = 3][time_a = 16][lat = 17][lon = 21];
    Float32 lat[lat = 17];
    Float32 lon[lon = 21];
    Float32 time[time = 16];
    Float64 source[source = 3];
} joinNew_numeric_coordValue.ncml;

The values we specified are in the coordinate variable ASCII data:

Float64 source[source = 3] = {1.2, 3.4, 5.6};


Metadata on Aggregations

It is possible to add or modify metadata on existing or new variables in an aggregation. The syntax for these varies somewhat, so we give examples of the different cases.

We will also give examples of providing metadata:

  • Adding/modifying metadata to the new coordinate variable
  • Adding/modifying metadata to the aggregation variable itself
  • Adding/modifying metadata to existing maps in an aggregated Grid

Metadata Specification on the New Coordinate Variable

We can add metadata to the new coordinate variable in two ways:

  • Adding it to the <variable> element directly in the case where the new coordinate variable and values is defined explicitly
  • Adding the metadata to an automatically created coordinate variable by leaving the <values> element out

The first case we have already seen, but we will show it again explicitly. The second case is a little different and we'll cover it separately.

Adding Metadata to the Explicit New Coordinate Variable

We have already seen examples of explicitly defining the new coordinate variable and giving its values. In these cases, the metadata is added to the new coordinate variable exactly like any other variable. Let's see the example again:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation with explicit map">
  
  <aggregation type="joinNew" dimName="sample_time">
    <variableAgg name="dsp_band_1"/> 
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>  
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/> 
  </aggregation> 
  
  <variable name="sample_time" shape="sample_time" type="float">
    <!-- Metadata here will also show up in the Grid map -->
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
    <values>100 200 400 1000</values>
  </variable>

</netcdf>

We see that the units attribute for the new coordinate variable has been specified. This subset of the DAS (we don't show the extensive global metadata) shows this:

   dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        sample_time {
 --->           String units "Days since 01/01/2010";
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    sample_time {
--->        String units "Days since 01/01/2010";
    }

We show the new metadata with the "--->" marker. Note that the metadata for the coordinate variable is also copied into the new map vector of the aggregated Grid.

Metadata can be specified in this way for any case where the new coordinate variable is listed explicitly.

Adding Metadata to An Autogenerated Coordinate Variable

If we expect the coordinate variable to be automatically added, we can also specify its metadata by referring to the variable without setting its values. This is useful in the case of using netcdf@coordValue and we will also see it is very useful when using a <scan> element for dynamic aggregations.

Here's a trivial example using the default case of the filename:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Test of adding metadata to the new map vector in a joinNew Grid aggregation">
 
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/> 
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
  </aggregation> 

  <!-- 
       Add metadata to the created new outer dimension variable after
       the aggregation is defined by using a placeholder variable
       whose values will be defined automatically by the aggregation.
  -->  
  <variable type="string" name="filename">
    <attribute name="units" type="string">Filename of the dataset</attribute>
  </variable>

</netcdf>

Note here that we just neglected to add a <values> element since we want the values to be generated automatically by the aggregation. Note also that this is almost the same way we'd modify an existing variable's metadata. The only difference is we need to "declare" the type of the variable here since technically the variable specified here is a placeholder for the generated coordinate variable. So after the aggregation is specified, we are simply modifying the created variable's metadata, in this case the newly generated map vector.

Here is the DAS portion with just the aggregated Grid and the new coordinate variable:

   dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        filename {
            String units "Filename of the dataset";
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    filename {
        String units "Filename of the dataset";
    }

Here also the map vector gets a copy of the coordinate variable's metadata.

We can also use this syntax in the case that netcdf@coordValue was used to autogenerate the coordinate variable:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation with coordValue and metadata">
  
  <aggregation type="joinNew" dimName="sample_time">
    <variableAgg name="dsp_band_1"/> 
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf" coordValue="1"/> 
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf" coordValue="10"/> 
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf" coordValue="15"/>  
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf" coordValue="25"/> 
  </aggregation> 
  
  <!-- Note: values are contrived -->
  <variable name="sample_time" shape="sample_time" type="double">
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
  </variable>

</netcdf>

Here we see the metadata added to the new coordinate variable and associated map vector:

Attributes {
   dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        sample_time {
 --->           String units "Days since 01/01/2010";
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    sample_time {
--->        String units "Days since 01/01/2010";
    }
}

Parse Errors

Since the processing of the aggregation takes a few steps, care must be taken in specifying the coordinate variable in the cases of autogenerated variables.

In particular, it is a Parse Error:

  • To specify the shape of the autogenerated coordinate variable if <values> are not set
  • To leave out the type or to use a type that does not match the autogenerated type

The second can be somewhat tricky to remember since for existing variables it can be safely left out and the variable will be "found". Since aggregations get processed fulled when the <netcdf> element containing them is closed, the specified coordinate variables in these cases are placeholders for the automatically generated variables, so they must match the name and type, but not specify a shape since the shape (size of the new aggregation dimension) is not known until this occurs.


Metadata Specification on the Aggregation Variable Itself

It is also possible to add or modify the attributes on the aggregation variable itself. If it is a Grid, metadata can be modified on the contained array or maps as well. Note that the aggregated variable begins with the metadata from the first dataset specified in the aggregation just like in a union aggregation.

We will use a Grid as our primary example since other datatypes are similar and simpler and this case will cover those as well.

An Aggregated Grid example

Let's start from this example aggregation:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf> 
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/> 
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>  
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/> 
  </aggregation> 
</netcdf>

Here is the DAS for this unmodifed aggregated Grid (with the global dataset metadata removed):

Attributes {
   dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        filename {
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    filename {
    }
}

We will now add attributes to all the existing parts of the Grid:

  • The Grid Structure itself
  • The Array of data within the Grid
  • Both existing map vectors (lat and lon)

We have already seen how to add data to the new coordinate variable as well.

Here's the NcML we will use. Note we have added units data to the subparts of the Grid, and also added some metadata to the grid itself.

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Showing how to add metadata to all parts of an aggregated grid">
  
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/> 
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/> 
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>  
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/> 
  </aggregation> 

  <variable name="dsp_band_1" type="Structure"> <!-- Enter the Grid level scope -->
    
1)  <attribute name="Info" type="String">This is metadata on the Grid itself.</attribute>
    
    <variable name="dsp_band_1"> <!-- Enter the scope of the Array dsp_band_1 -->
2)    <attribute name="units" type="String">Temp (packed)</attribute> <!-- Units of the array -->
    </variable> <!-- dsp_band_1.dsp_band_1 -->
    
    <variable name="lat"> <!-- dsp_band_1.lat map -->
3)    <attribute name="units" type="String">degrees_north</attribute>
    </variable> 
    
    <variable name="lon"> <!-- dsp_band_1.lon map -->
4)    <attribute name="units" type="String">degrees_east</attribute>
    </variable> <!-- dsp_band_1.lon map -->    
  </variable> <!-- dsp_band_1 Grid -->

  <!-- Note well: this is a new coordinate variable so requires the correct type.
  Also note that it falls outside of the actual grid since we must specify it 
  as a sibling coordinate variable it will be made into a Grid when the netcdf is closed. 
  -->
  <variable name="filename" type="String">
5)  <attribute name="Info" type="String">Filename with timestamp</attribute>
  </variable> <!-- filename -->
 
</netcdf

Here we show metadata being injected in several ways, denoted by the 1) -- 5) notations.

1) We are inside the scope of the top-level Grid variable, so this metadata will show up in the attribute table inside the Grid Structure. 2) This is the actual data Array of the Grid, dsp_band_1.dsp_band_1. We specify the units are a packed temperature. 3) Here we are in the scope of a map variable, dsp_band_1.lat. We add the units specification to this map. 4) Likewise, we add units to the lon map vector. 5) Finally, we must close the actual grid and specify the metadata for the NEW coordinate variable as a sibling of the Grid since this will be used as the canonical prototype to be added to all Grid's which are to be aggregated on the new dimension. Note in this case (unlike previous cases) the type of the new coordinate variable is required since we are specifying a "placeholder" variable for the new map until the Grid is actually processed once its containing <netcdf> is closed (i.e. all data is available to it).

The resulting DAS (with global dataset metadata removed for clarity):

Attribute {
... global data clipped ...
  dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
 1)   String Info "This is metadata on the Grid itself.";
        filename {
 5)       String Info "Filename with timestamp";
        }
        dsp_band_1 {
2)        String units "Temp (packed)";
        }
        lat {
            String name "lat";
            String long_name "latitude";
3)        String units "degrees_north";
        }
        lon {
            String name "lon";
            String long_name "longitude";
4)        String units "degrees_east";
        }
    }
    filename {
5)    String Info "Filename with timestamp";
    }
}

We have annotated the DAS with numbers representing which lines in the NcML above correspond to the injected metadata.

Dynamic Aggregations Using Directory Scanning

A powerful way to create dynamic aggregations rather than by listing datasets explicitly is by specifying a data directory where aggregation member datasets are stored and some criteria for which files are to be added to the aggregation. These criteria will be combined in a conjunction (an AND operator) to handle various types of searches. The way to specify datasets in an aggregation is by using the <scan> element inside the <aggregation> element.

A key benefit of using the <scan> element is that the NcML file need not change as new datasets are added to the aggregation, say by an automated process which simply writes new data files into a specific directory. By properly specifying the NcML aggregation with a scan, the same NcML will refer to a dynamically changing aggregation, staying up to date with current data, without the need for modifications to the NcML file itself. If the filenames have a timestamp encoded in them, the use of the dateFormatMark allows for automatic creation of the new coordinate variable data values as well, as shown below.

The scan element may be used to search a directory to find files that match the following criteria:

  • Suffix : the aggregated files end in a specific suffix, indicating the file type
  • Subdirectories: any subdirectories of the given location are to be searched and all regular files tested against the criteria
  • Older Than: the aggregated files must have been modified longer than some duration ago (to exclude files that may be currently being written)
  • Reg Exp: the aggregated file pathnames must match a specific regular expression
  • Date Format Mark: this highly useful criterion, useful in conjunction with others, allows the specification of a pattern in the filename which encodes a timestamp. The timestamp is extracted from the filenames using the pattern and is used to create ISO 8601 date elements for the new dimension's coordinate variable.

We will give examples of each of these criteria in use below. First, we discuss the location specification.

Location (Location Location...)

The most important attribute of the scan element is the scan@location element that specifies the top-level search directory for the scan, relative to the BES data root directory specified in the BES configuration.

IMPORTANT: ALL locations are interpreted relative to the BES root directory and NOT relative to the location of the NcML file itself! This means that all data to be aggregated must be in a subdirectory of the BES root data directory and that these directories must be specified fully, not relative to the NcML file.

For example, if the BES root data dir is "/usr/local/share/hyrax", let ${BES_DATA_ROOT} refer to this location. If the NcML aggregation file is in "${BES_DATA_ROOT}/data/ncml/myAgg.ncml" and the aggregation member datasets are in "${BES_DATA_ROOT}/data/hdf4/myAggDatasets", then the location in the NcML file for the aggregation data directory would be:

<scan location="data/hdf4/myAggDatasets"/>

which specifies the data directory relative to the BES data root as required.

Again, due to security reasons, the data is always searched under the BES data root. Trying to specify an absolute filesystem path, such as:

<scan location="/usr/local/share/data"/>

will NOT work. This directory will also be assumed to be a subdirectory of the ${BES_DATA_ROOT}, regardless of the preceding "/" character.


Suffix Criterion

The simplest criterion is to match only files of a certain datatype in a given directory. This is useful for filtering out text files and other files that may exist in the directory but which do not form part of the aggregation data.

Here's a simple example:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Example of joinNew Grid aggregation using the scan element.">
  
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/> 
    <scan location="data/ncml/agg/grids" suffix=".hdf"/>
  </aggregation> 
  
</netcdf>

Assuming that the specified location "data/ncml/agg/grids" contains no subdirectories, this NcML will return all files in that directory that end in ".hdf" in alphanumerical order. In the case of our installed example data, there are four HDF4 files in that directory:

data/ncml/agg/grids/f97182070958.hdf
data/ncml/agg/grids/f97182183448.hdf
data/ncml/agg/grids/f97183065853.hdf
data/ncml/agg/grids/f97183182355.hdf 

These will be included in alphanumerical order, so the scan element will in effect be equivalent to the following list of <netcdf> elements:

<netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
<netcdf location="data/ncml/agg/grids/f97182183448.hdf"/> 
<netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>  
<netcdf location="data/ncml/agg/grids/f97183182355.hdf"/> 

By default, scan will search subdirectories, which is why we mentioned "grids has no subdirectories". We discuss this in the next section.


Subdirectory Searching (The Default!)

If the author specifies the scan@subdirs attribute to the value "true" (which is the default!), then the criteria will be applied recursively to any subdirectories of the scan@location base scan directory as well as to any regular files in the base directory.

For example, continuing our previous example, but giving a higher level location:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation using the scan element.">
  
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/> 
    <!-- This will recurse into the "grids" subdir and grab all *.hdf files there -->
    <scan location="data/ncml/agg/" suffix=".hdf" subdirs="true"/>
  </aggregation> 
  
</netcdf>

Assuming that only the "grids" subdir of "/data/ncml/agg" contains HDF4 files with that extension, the same aggregation as prior will be created, i.e. an aggregation isomorphic to:

<netcdf location="data/ncml/agg/grids/f97182070958.hdf"/> 
<netcdf location="data/ncml/agg/grids/f97182183448.hdf"/> 
<netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>  
<netcdf location="data/ncml/agg/grids/f97183182355.hdf"/> 

The scan@subdirs attribute is much for useful for turning off the default recursion. For example, if recursion is NOT desired, but only files with the given suffix in the given directory are required, the following will do that:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation using the scan element.">
  
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/> 
    <!-- Find *.hdf files ONLY in the given location and NOT in subdirs -->
    <scan location="data/ncml/agg/grids" suffix=".hdf" subdirs="false"/>
  </aggregation> 


Order of Inclusion

In cases where a dateFormatMark is not specified, the member datasets are added to the aggregation in alphabetical order on the full pathname. This is important in the case of subdirectories since the path of the subdirectory is taken into account in the sort.

In cases where a dateFormatMark is specified, the extracted ISO 8601 timestamp is used as the sorting criterion, with older files being added before newer files.