OPULS: NOAA S3 Data Access: Difference between revisions

Revision as of 17:23, 29 August 2014

Overview

With NOAA funding OPeNDAP collaborated with Deirdre Byrne, Jeff Ogata , and John Relph. This NOAA team had created a publicly accessible S3 bucket of data files, along with a directed graph catalog built of XML files. We wrote software for Hyrax that allowed Hyrax to provide THREDDS (with XSLT) catalogs of the holdings along with DAP data access. This was accomplished by writing a special version of the OLFS that could traverse the S3 held catalogs and could utilize the BES gateway function to retrieve (and cache) data files from S3 and provide DAP access to them.

What follows is a description of how one might obtain, configure, and run the software against an appropriately populated S3 bucket provided by OPeNDAP.

Get

You will need to get and install Hyrax. The you'll need to compile and install a new java component.

Hyrax

Get and install the latest version of Hyrax 1.9.x or later. That is easiest since you can get binaries. If you must build from source use the Hyrax 1.9 release branch

Verify that Hyrax is working correctly before moving on.

S3 Service

The S3 DAP Service is really an alternate OLFS. To use it, you will need to check out the OLFS project from our subversion repository: https://scm.opendap.org/svn/branch/olfs/release/hyrax-1.9/

Build

Build the S3 DAP service:

ant -f s3-build.xml server

Install

Install the S3 DAP Service into the same Tomcat server into which you install the OLFS/Hyrax by copying the s3.war file that you just built into Tomcat's webapps directory.

cp build/dist/s3.war $CATALINA_HOME/webapps

Restart Tomcat. This will case the S3 DAP service to start and install it's default configuration into $CATALINA_HOME/content/s3

Configure

The configuration for the S3 DAP Service is located in the file $CATALINA_HOME/content/s3/s3.xml

<S3Config>
    <DapService context="/dap" />
    <CatalogService context="/catalog" />
    <CatalogCache>/Users/ndp/scratch/s3Test/catalogCache</CatalogCache>
    <Bucket name="opendap.test" context="/test" loadCatalogOnStart="true" />
    <!-- Bucket name="foo-s3cmd.nodc.noaa.gov" context="/foo-s3cmd" loadCatalogOnStart="true" / -->
</S3Config>

Run

Theory of Operation

The S3 Bucket

In which we encode catalog path data into the unique key values of each file in our bucket.

While an S3 bucket is truly "flat" in that each item in it is identified by a unique and opaque "key". However we utilize a convention in which we make the key value a file path with the data file's basename at the end.

This allows us to easily make a hierarchical directed graph style catalog where each node names its parent, leaves, and child nodes

Files

An S3 bucket must be constructed that contains a collection of files. When the files are added to the bucket the S3 bucket key value must be set to the catalog path of the file followed by it's base name.

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?> <index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="/data" name="opendap.test" delimiter="/" encoding="UTF-8">

 <folder name="nc" size="231402720" count="1"/>

</index>

The Software

The S3 DAP Service is a sibling to the OLFS and utilizes much of the code found there. It requires a running BES with all the appropriate data handlers and the gateway module installed. At startup the service reads it's configuration file which contains the context names for the DAP and Catalog services, the local filesystem location to be used for a catalog cache, and a collection of S3 bucket names to be used as source data collections. Collection names and bucket names are specified independently so that the S3 bucket need not be a path component in the URLs formed by the S3 DAP Service. Once configured, the catalogs are loaded by the server at startup where they are used to present THREDDS catalogs (with XSLT) for browser traversal.

So for example, an S3 bucket mapped by the configuration to the context "test" would have a top-level THREDDS catalog URL of http://localhost:8080/s3/catalog/test/catalog.xml and might return this THREDDS catalog:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?>
<index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="" name="opendap.test" delimiter="/" encoding="UTF-8">
  <folder name="data" size="231402720" count="1"/>
</index>

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/s3/xsl/threddsPresentation.xsl"?>
<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:xlink="http://www.w3.org/1999/xlink">
  <service name="dap" serviceType="OPeNDAP" base="/s3/dap" />
  <dataset name="/" ID="/s3/catalog/">
    <catalogRef name="data" xlink:title="data" xlink:type="simple" ID="/data/" xlink:href="/s3/catalog/test/data/catalog.xml" />
  </dataset>
</catalog>

Note: The returned XML contains a style sheet reference that allows browsers to render the THREDDS catalog as HTML

In support of the DAP services stack, the S3 DAP Service software maintains lists of the S3 resource ID's gleaned from the catalog and uses these to construct BES requests for DAP services by access the BES Gateway Module which can retrieve/cache HTTP resources and provide DAP services for them.

NOAA generated /index.xml files

These files are read by the Hyrax S3 DAP service and once ingested are used to generate data access URL's, and THREDDS catalogs (with XSLT) which allow users to navigate the holdings described in the /index.xml files.

Why name them "/index.xml"?: "The reason for the doubled delimiter // is to prevent confusion of the generated S3 index documents with documents uploaded to the bucket in the normal manner. As i'm sure you know, while // is equivalent to / in a UNIX path, in Amazon S3, bucket keys are simply strings following the convention that / is a directory delimiter, but with no interpretation or canonicalization happening at the S3 end. So in S3 it is possible to store a file named /index.xml as a distinct document from one named index.xml, and this is the strategy the script is employing to keep the index documents out of the namespace of the uploaded content that the indexes are providing a listing of." - Jeff Ogata; "When I recently created our test bucket I discovered that I needed to re-inperpret Jeff's statement. It is now the case, that while the keys are still just keys (opaque ID's), AWS has changed the GUI for S3 so that it shows the content as a directed graph of things informed by the collection of "/" separated strings in the keys. But AWS interprets the occurrence of "//" in the key as an "nameless" subdirectory. So by naming the catalog files "/index.xml" we are essentially creating nameless directory/folder (unsupported by most OS's) and dumping the index.xml file into it. Just another twist on the concept. " - ndp (talk)

Examples

These are simple files and rather than explain them in great detail I will just provide some minimal examples.

The following examples are taken from the S3 bucket "opendap.test".

The well known entry point into the catalog is the key "/index.xml"

/index.xml

key: /index.xml
URL: http://s3.amazonaws.com/opendap.test//index.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?>
<index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="" name="opendap.test" delimiter="/" encoding="UTF-8">
  <folder name="data" size="231402720" count="1"/>
</index>

data//index.xml

key: data//index.xml
URL: http://s3.amazonaws.com/opendap.test/data//index.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?>
<index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="/data" name="opendap.test" delimiter="/" encoding="UTF-8">
  <folder name="nc" size="231402720" count="1"/>
</index>

data/nc//index.xml

key: data/nc//index.xml
URL: http://s3.amazonaws.com/opendap.test/data/nc//index.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?>
<index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="/data/nc" name="opendap.test" delimiter="/" encoding="UTF-8">
  <file name="coads_climatology.nc" last-modified="2008-05-29T16:31:52.000Z" size="3114044"/>
  <file name="fnoc1.nc" last-modified="2008-05-28T14:28:52.000Z" size="24230"/>
  <file name="sst.mnmean.nc" last-modified="2013-09-16T17:00:52.000Z" size="59547208"/>
  <file name="200803061600_HFRadar_USEGC_6km_rtv_SIO.nc" last-modified="2013-07-02T20:00:52.000Z" size="2590804"/>
  <file name="AG2006001_2006003_ssta.nc" last-modified="2013-07-02T20:00:52.000Z" size="21647220"/>
  <file name="MB2006001_2006001_chla.nc" last-modified="2013-07-02T20:00:52.000Z" size="140904652"/>
  <file name="a21160601.nc" last-modified="2013-07-02T20:00:52.000Z" size="3574848"/>
</index>

Additional Thing's That Could Be Done

Integrate with the Glacier Async Service so that glacier products are unpacked into a target S3 bucket with the appropriate key and from there served by the Hyrax S3 service. This would provide a tiered storage architecture where the most expensive (and fastest) storage is only used as local cache by each instance of the server.
In addition to reading the NOAA /index.xml files for the catalog this code could (should!) be extended to read THREDDS catalogs cached in S3 along side the data and provide the catalog service that way (or even both ways). This could probably be done by subclassing the openda.threddsHandler.StaticCatalogDispatch class (and surely other stuff too)

Questions

Ask us questions at support@opendap.org and we'll help as best we can.

@@ Line 31: / Line 31: @@
 == Configure ==
+The configuration for the S3 DAP Service is located in the file <code>$CATALINA_HOME/content/s3/s3.xml</code>
 <source lang="xml">