DAP4: Async Service for AWS Glacier

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽

Background

As part of the DAP4 specification we have defined a mechanism through which clients and servers can negotiate access to data resources that are "near-line". These data resources may be held in a tape drive system or some other system that takes an intermediate amount of time (a few hours to a few days) to extract a stored resource. The Amazon Web Services Glacier system is an excellent example of such a near-line system and we started developing a prototype service to provide DAP4 client access to Glacier holdings.

Overview

Here are some Glacier axioms (things which we hold to be true, even if they make life hard):

  1. The Glacier service is organized into "vaults" (a named container similar to the AWS S3 bucket) into which things (data resources, or in Glacier speak "archives" which are similar to the AWS S3 "object"s) are put
  2. The Glacier service typically takes about 4 hours to respond to a request for an "archive".
  3. The Glacier service builds an "inventory" of each vault's contents every 24 hours.
  4. Requests for the current vault inventory also take 4 hours to complete.
  5. When an archive is uploaded to Glacier the name (aka archive-id) under which it is stored is determined by Glacier and returned when the upload is finished. So if you ever want to get that archive back you either have to remember the archive-id locally or be prepared to find it in the not so easy to get inventory.
  6. When you upload an archive you can provide a description of it. The AWS documentation suggests that you use your own resourceId (for example a local path/file name string) so that if you do need to rebuild your local representation of the archive you can do so from a Glacier inventory (which contains both the archive-ids and the descriptions).
  7. Glacier will happily allow you to upload the same resource over and over again. Each time you upload the resource you will get a new archive-id for it. The previously uploaded copies will remain in the vault associated with their previously assigned archive-ids unless you actively delete them.
  8. Archives can be deleted.
  9. Vaults can be deleted.
  10. Uploading archives, deleting archives, and deleting vault operations do not require asynchronous behavior on the part of the client. In other words these operations happen in "real time".
  11. Retrieving archives and vault inventories require asynchronous behavior capabilities on the part of the client

(And the Real Numbers are defined with only 16 axioms)

Design Issues

In order to utilize Glacier as part of a DAP4 service several things need to be in place: A catalog system, some kind of database that associates glacier holdings with resources identified in the catalog, and a system for holding retrieved glacier archives/resources for use by the service.

Catalog

catalog
A directed graph that embodies a hierarchical representation of a collection of data resources.

Some abstract representation of the Glacier holdings will be required. While catalogs are not part of the DAP4 spec we know that in general DAP servers return catalogs in the form of THREDDS catalogs or as HTML pages that represent catalog nodes and that contain links to data resources and other catalog nodes. While a Glacier vault might simply be represented by a list of archive-ids this is not a representation that users of data resources would find very useful, especially when coupled with the delay in discovering anything about any single resource.

The catalog should be close to the server and quickly accessible.

Archive Records

ArchiveId
The string which is returned by Glacier to identify an uploaded archive and through which said archive may be retrieved from Glacier. These appear to be GUIDs. Example: XIfGXl0qWkfDVbKL3Ibb6XQ8OBC44YwOBGl9ds-yA18HN8zDOBi1A2ZugWprFHgwL-8LRQLSfz8BqoJLaMdaflPgM-J-_D0gHQwS_CDHXp5R0vHUADuZKd49oP4eir2qw508EJ57kQ
ResourceId
A string that is used by a DAP server to identify a particular (data) resource. Typically a path/file name, but in practice this would appear as a local path fragment in a DAP URL after the domain-port-service part of the URL. For example in the url http://test.opendap.org:8080/opendap/hyrax/data/nc/fnoc1.nc the resourceId would be data/nc/fnoc1.nc
Archive Record
A record (as in a file or database entry) that contains at minimum a resource's resourceId and archiveId

If all that was required was to associate the resourceId and archiveId it might seem reasonable to just include this small set of information in the persisted catalog for the Glacier vault. However because retrieving anything from Glacier costs time (and money) it is probably important to capture some kind of information about the each archived resource locally and in such a way that users can inspect resource metadata without requesting that the archived resource be retrieved just to find out that the resource is of little or no interest.

Suggestion: At minimum store the dap4:DMR (or dap2:DDX) document in the local Archive Record. Consider also storing the names and values of any MAP arrays found in the dataset. Maybe use the XML data response to generate this content as part of the uploading to Glacier process.

Example: Glacier Archive Record

Resource Cache

Resources that have been retrieved from Glacier must be put somewhere so that the DAP (and possibly other) services can utilize them. I call this a cache because I think it's implied that the volume of things in a Glacier vault may be much larger than any reasonable machine running this service may be able to hold. Therefore:

  1. The retrieved Glacier resources must be cached.
  2. Cached resources should be retrievable by utilizing their resourceId.
  3. The cache must have a size limit.
  4. There must be software that manages the cache contents by applying a metric to determine which resources should be purged from the cache and when.

Implementation Issues

All access to Glacier must be authenticated and this argues for using one of Amazon's APIs instead of 'direct' access with HTTP GET, PUT, etc. Amazon provides Java and Python APIs (as well as some others) but notably not C/C++.