AIS Using NcML

From OPeNDAP Documentation
Revision as of 21:39, 9 September 2009 by Mjohnson (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


This and the BES Aggregation using NcML page go hand-in-hand. The essential idea is to use NcML as a syntax to describe both aggregations of data sets (e.g., HDF4 files) and ancillary information that should be added to a data set. The motivation for using NcML is to not invent a new syntax and instead build on an accepted one, maybe adding new features where we need them.

NOTE: The version 0.9.0 of the NcML Module in the svn tree implements this design, plus some other features (creation of new variables). Please see the Wiki docs for the handler at: BES_-_Modules_-_NcML_Module

1 Use Cases

  1. Add the NcML handler to the BES
  2. Add attributes to a single data set
  3. Adding one or more attributes to a group of data sets This use case is not complete since the scan element is not defined outside of an aggregation element
  4. Using the NcML Handler to get information

2 Definitions

AIS
Ancillary Information Service
Hyrax
Hyrax is the next generation server from OPeNDAP. It utilizes a modular design that employs a light weight Java servlet (aka OLFS) to provide the public-accessible client interface, and a back-end daemon, the BES to handle the heavy lifting.
BES
OPeNDAP Back-End Server (BES) is a high-performance back-end server software framework that allows data providers more flexibility in providing end users views of their data. The current OPeNDAP data objects (DAS, DDS, and DataDDS) are still supported, but now data providers can add new data views, provide new functionality, and new features to their end users through the BES modular design. Providers can add new data handlers, new data objects/views, the ability to define views with constraints and aggregation, the ability to add reporting mechanisms, initialization hooks, and more.
OLFS
The OPeNDAP Lightweight Frontend Servlet (OLFS) provides the public-accessible client interface for Hyrax. The OLFS communicates with the Back End Server (BES) to provide data and catalog services to clients. The OLFS implements the DAP2 protocol and supports some of the new DAP4 features.
Aggregation
A single data set (i.e., something referenced by a single DAP URL) that is actually made up of two or more discreet things, each of which (potentially at least) has their own DAP URL.
Data set
Anything that can be referenced by a DAP URL and that will return the DAP responses when requested.
NcML
Syntax for ancillary data (attributes and variables) and aggregations used by the TDS
NetCDF
NetCDF (network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
HDF
Hierarchical Data Format (HDF) is provided by The HDF Group. The HDF Group provides a unique suite of technologies and supporting services that make possible the management of large and complex data collections. Its mission is to advance and support HDF technologies and ensure long-term access to HDF data.
WCS
The OpenGIS® Web Coverage Service Interface Standard (WCS) defines a standard interface and operations that enables interoperable access to geospatial "coverages" [1]. The term "grid coverages" typically refers to content such as satellite images, digital aerial photos, digital elevation data, and other phenomena represented by values at each measurement point.

3 Background

This new BES handler will be used to introduce new attributes into data sets for the IOOS/WCS project and for the REAP project. In the first case, the augmented DDX response generated by the handler will be filtered through XSLT to produce a WCS response of one form or another. In the second case, the DDX will be filtered to produce an EML document. So, this handler and the collection(s) of XML/NcML/? documents will be an important part of several projects we're working on. Beyond these two projects, this handler will provide important features to Hyrax.

3.1 Hyrax & BES Documentation

3.2 NcML Information

Here are links that describe NcML 2.2:

Notes:

  1. NcML 2.2 is based on the CDM and thus includes Groups and shared dimensions, which DAP 3.2 does not support. We will want to elide that feature until DAP 4 is done and well supported.

4 Design

Given a Hyrax server with a single URL that looks like http://test.opendap.org/dap/data/nc/fnoc1.nc from which you can get the usual set of DAP data products (DAS, DDS, DataDDS, ASCII, HTML form and Info) adding the NcML Handler to that server's BES and writing a suitable NcML file (e.g., /data/ncml/fnoc_improved.ncml) would cause that server to have a second URL that would look like hhtp://test.opendap.org/dap/data/ncml/fnoc_improved.ncml to a DAP client. A DAP client could get the usual cast of suspects for this URL, too. Lets assume that the /data/ncml/fnoc_improved.ncml file adds some attributes to the /data/nc/fnoc1.nc data file, then the file /data/ncml/fnoc_improved.ncml would look something like:

<netcdf location="/data/nc/fnoc1.nc">
    <variable name="u">
        <attribute type="int32" name="max" value="2000"/>
        <attribute type="int32" name="min" value="0"/>
    </variable>
</netcdf>

The NcML handler would be triggered by the BES to read this file, it would see that the value of location is '/data/nc/fnoc1.nc' so it would invoke the BES *within which it's running* to get the needed DAP object. The BES would sort out how to do that and just go do it, returning the right thing to the NcML handler which would then parse the rest of the NcML file and stuff the additional info into the DAS/DDS/DDX and return the end result.

The NcML handler will use the NcML document to find a 'source' data file and read a DAP object from it and then augment that DAP object using information in the NcML file. Because the NcML handler will use the BES to get the DAP objects, it will be able to add information to any file served by the BES, including those that are served by custom or 'one-off' handlers. This will make the NcML handler very flexible.

4.1 Example responses

Suppose the fnoc1.nc data set returns the following DDX:

<?xml version="1.0" encoding="UTF-8"?>
<Dataset name="fnoc1.nc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://xml.opendap.org/ns/DAP2"
xsi:schemaLocation="http://xml.opendap.org/ns/DAP2  http://xml.opendap.org/dap/dap2.xsd">

    <Array name="u">
        <Attribute name="units" type="String">
            <value>meter per second</value>
        </Attribute>
        <Attribute name="long_name" type="String">
            <value>Vector wind eastward component</value>
        </Attribute>
        <Attribute name="missing_value" type="String">
            <value>-32767</value>
        </Attribute>
        <Attribute name="scale_factor" type="String">
            <value>0.005</value>
        </Attribute>
        <Int16/>
        <dimension name="time_a" size="16"/>
        <dimension name="lat" size="17"/>
        <dimension name="lon" size="21"/>
    </Array>

Here's the DDX that would be returned when accessing the fnoc_improved.ncml data set (I've put 'data set' in bold because I want to emphasize that the NcML file essentially defines a new data set and the 'old' data set (i.e., fnoc1.nc) is still available using its URL. [mjohnson: is the dataset name in the modified ddx correct as referring still to the underlying file, or do we want it to be "fnoc_improved.ncml"?]

<?xml version="1.0" encoding="UTF-8"?>
<Dataset name="fnoc1.nc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://xml.opendap.org/ns/DAP2"
xsi:schemaLocation="http://xml.opendap.org/ns/DAP2  http://xml.opendap.org/dap/dap2.xsd">

    <Array name="u">
        <Attribute name="units" type="String">
            <value>meter per second</value>
        </Attribute>
        <Attribute name="long_name" type="String">
            <value>Vector wind eastward component</value>
        </Attribute>
        <Attribute name="missing_value" type="String">
            <value>-32767</value>
        </Attribute>
        <Attribute name="scale_factor" type="String">
            <value>0.005</value>
        </Attribute>

        <!-- Here is the added stuff -->

        <Attribute name="max" type="Int32">
            <value>2000</value>
        </Attribute>
        <Attribute name="min" type="Int32">
            <value>0</value>
        </Attribute>

        <!-- End of the added stuff -->

        <Int16/>
        <dimension name="time_a" size="16"/>
        <dimension name="lat" size="17"/>
        <dimension name="lon" size="21"/>
    </Array>

4.2 Detailed Design

Control Flow in the NcML AIS Handler

The overall design of the NcML AIS handler is shown to the right in a UML Activity diagram. First the handler receives a request for a certain response given a specific container. In general a handler can get a request that involves several containers, but not this handler, at least not in the initial versions. Then the request is split into one for metadata (a DDS, DAS or DDX) or data (DataDDS). In the latter case the NcML is parsed only to determine the netcdf@localtion attribute's value and that data source's DataDDS is accessed using the BES and that response is returned by this handler. In the case of a metadata request, the DDX response is sought for the data source named in the @location attribute and then augmented with information in the NcML file. The result is used to return on of the three DAP2/3/4 metadata responses.

Here is a high resolution version of the activity diagram shown to the right.

Important points for this design:

4.2.1 NcML AIS: Build a BES Handler

This describes how to build a basic DAP handler for the BES

4.2.2 Parse NcML

Since NcML was designed for the netCDF data model and that does not match exactly the DAP data model, how should various parts of NcML be used by this handler?

4.2.3 Get the Response from the BES

How do you get a response object 'within the BES?' In other words, one BES typically has a number of data handlers installed for a variety of data formats (netcdf, hdf4, et c.) and it also has response handlers that transmit standard DAP responses like the ASCII module (which takes a DataDDS response and transmits it in an ASCII text/plain type MIME document). In the case of the NCML module, we need the ability to take the different datasets represented in the ncml file and hand them off to their respective data/request handlers. For each dataset in the ncml file we need to determine what data/request handler handles that particular dataset give the type of data in the dataset (nc for netcdf, h5 for hdf5, etc...). Once we find the data/request handler for that dataset we hand off the request to that hanlder to fill in the response object (DAS, DDS, DataDDS).

To do this we will do the following. For each location (dataset) in the ncml file (let's assume local datasets for now), create a BESContainer object in the catalog BESContainerStorage.

  1. Find the "catalog" container storage from the BESContainerStorageList::TheList().
  2. Call add_container given the symbolic name of the dataset (which could be constructed given the symbolic name of the ncml container and the dataset basename) and the location of the dataset (relative to the document root specified in the bes.conf file).
  3. Call look_for on that container storage to get the container just created
  4. hang on to the container in the dhi. We'll need to set it back when we're done
  5. set the container in dhi to the container just created
  6. Call BESRequestHandlerList::TheList()->execute_current( dhi ) and the BES framework will take care of the rest
  7. Set the dhi.container back to the ncml container we saved off, and move on to the next location (could probably do this outside of the loop iterating over locations)

Please refer to NCMLRequestHandler in http://scm.opendap.org/trac/browser/tags/ncml_module/initial, which is a test module that I created to do this work.

Of course, as mentioned above, this assumes that all of the datasets are local. There could be datasets that are handled on a different machine. For this, we could hand off those files to a BES running on the remote machine. Also, if there are many files in the ncml we could split them up and hand off some of the datasets to a different BES process running on this machine. We could do this in the NCML module, or we could work to create a general solution for this within the BES framework. The latter is appealing to me (Patrick).

4.2.4 Augment the Response

Once the handler has the DDX response, what does it need to do to insert new information in the C++ object(s)? Another question is, what if there are multiple containers? And what if those datasets are aggregated? We would need to add the new attributes to the resulting aggregated dap object.

5 Deliverables

  1. The NcML handler. It will run in the BES.
  2. Instructions on how to use said handler.

6 Period of use

This will be used for the remainder of the IOOS and REAP projects and hopefully for a long time thereafter.