OPULS: NOAA S3 Data Access: Difference between revisions

Revision as of 17:45, 28 August 2014

Overview

With NOAA funding OPeNDAP collaborated with Deirdre Byrne, Jeff Ogata , and John Relph. This NOAA team had created a publicly accessible S3 bucket of data files, along with a directed graph catalog built of XML files. We wrote software for Hyrax that allowed Hyrax provide HTML and THREDDS catalog pages along with DAP data access. This was accomplished by writing a special version of the OLFS that could traverse the S3 held catalogs and could utilize the BES gateway function to retrieve (and cache) data files from S3 and provide DAP access to them.

What follows is a description of how one might obtain, configure, and run the software against an appropriately populated S3 bucket provided by OPeNDAP.

Get

You can get the software from our subversion repository. You will need

Build

Install

Configure

Run

Theory of Operation

NOAA generated /index.xml files

Why name them "/index.xml"?: "The reason for the doubled delimiter // is to prevent confusion of the generated S3 index documents with documents uploaded to the bucket in the normal manner. As i'm sure you know, while // is equivalent to / in a UNIX path, in Amazon S3, bucket keys are simply strings following the convention that / is a directory delimiter, but with no interpretation or canonicalization happening at the S3 end. So it is possible to store //index.xml as a distinct document from /index.xml, and this is the strategy the script is employing to keep the index documents out of the namespace of the uploaded content that the indexes are providing a listing of." - Jeff Ogata; When I recently created our test bucket I discovered that I needed to re-inperpret Jeff's statement. It is now the case, that while the keys are still just keys (opaque ID's), AWS has changed the GUI for S3 so that it shows the content as a directed graph of things informed by the collection of "/" separated strings in the keys. But AWS interprets the occurrence of "//" in the key as an "nameless" subdirectory. So by naming the catalog files "/index.xml" we are essentially creating nameless directory/folder (unsupported by most OS's) and dumping the index.xml file into it. Just another twist on the concept.

Examples

These are simple files and rather than explain them in great detail I will just provide some minimal examples.

The following examples are taken from the S3 bucket "opendap.test".

The well known entry point into the catalog is the key "/index.xml"

/index.xml

key: /index.xml
URL: http://s3.amazonaws.com/opendap.test//index.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?>
<index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="" name="opendap.test" delimiter="/" encoding="UTF-8">
  <folder name="data" size="231402720" count="1"/>
</index>

data//index.xml

key: data//index.xml
URL: http://s3.amazonaws.com/opendap.test/data//index.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?>
<index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="/data" name="opendap.test" delimiter="/" encoding="UTF-8">
  <folder name="nc" size="231402720" count="1"/>
</index>

data/nc//index.xml

key: data/nc//index.xml
URL: http://s3.amazonaws.com/opendap.test/data/nc//index.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/opendap.test//index.xsl'?>
<index xmlns="http://nodc.noaa.gov/s3/catalog/1.0" base="https://s3.amazonaws.com/opendap.test" path="/data/nc" name="opendap.test" delimiter="/" encoding="UTF-8">
  <file name="coads_climatology.nc" last-modified="2008-05-29T16:31:52.000Z" size="3114044"/>
  <file name="fnoc1.nc" last-modified="2008-05-28T14:28:52.000Z" size="24230"/>
  <file name="sst.mnmean.nc" last-modified="2013-09-16T17:00:52.000Z" size="59547208"/>
  <file name="200803061600_HFRadar_USEGC_6km_rtv_SIO.nc" last-modified="2013-07-02T20:00:52.000Z" size="2590804"/>
  <file name="AG2006001_2006003_ssta.nc" last-modified="2013-07-02T20:00:52.000Z" size="21647220"/>
  <file name="MB2006001_2006001_chla.nc" last-modified="2013-07-02T20:00:52.000Z" size="140904652"/>
  <file name="a21160601.nc" last-modified="2013-07-02T20:00:52.000Z" size="3574848"/>
</index>

Additional Thing's That Could Be Done

Integrate with the Glacier Async Service so that glacier products are unpacked into a target S3 bucket with the appropriate key and from there served by the Hyrax S3 service. This would provide a tiered storage architecture where the most expensive (and fastest) storage is only used as local cache by each instance of the server.
In addition to reading the NOAA /index.xml files for the catalog this code could (should!) be extended to read THREDDS catalogs cached in S3 along side the data and provide the catalog service that way (or even both ways). This could probably be done by subclassing the openda.threddsHandler.StaticCatalogDispatch class (and surely other stuff too)

Questions

Ask us questions at support@opendap.org and we'll help as best we can.

@@ Line 27: / Line 27: @@
 ==== Examples ====
+These are simple files and rather than explain them in great detail I will just provide some minimal examples.
 The following examples are taken from the S3 bucket "opendap.test".
@@ Line 33: / Line 35: @@
-===== key: /index.xml =====
+=====  /index.xml =====
+'''key:''' /index.xml <br/>
 '''URL:''' http://s3.amazonaws.com/opendap.test//index.xml
@@ Line 45: / Line 48: @@
-===== key: data//index.xml =====
+===== data//index.xml =====
+'''key:''' data//index.xml  <br/>
 '''URL:''' http://s3.amazonaws.com/opendap.test/data//index.xml
@@ Line 58: / Line 62: @@
-===== key: data/nc//index.xml =====
+===== data/nc//index.xml =====
+'''key:''' data/nc//index.xml <br/>
 '''URL:''' http://s3.amazonaws.com/opendap.test/data/nc//index.xml
@@ Line 74: / Line 79: @@
 </index>
 </source>
 == Additional Thing's That Could Be Done ==