REAP Cataloging and Searching

Summary

Use Cases

Add information about a data set to the catalog

Search the catalog

Needs significant work

Specific data sets for the REAP Ocean Use

Empty

Definitions

Background

For systems like the ones which use DAP the biggest data location problems are getting valid metdata that is sufficiently uniform and making a smooth transition from the initial process of location to the selection of individual parts of a chosen data set.

The problem of heterogeneous configuration

The first problem is the really the problem of finding data sets in the vernacular sense of finding and data set while the second is the problem of taking a number of ad hoc heterogeneous storage and organizational configurations and mapping them into a (mostly) uniform access mechanism. While both problems present a significant challenge, the second can be effectively addressed by aggregating the discrete elements of the actual data set's configuration (e.g., 20,000 images in a file system where different years and months are each in nested directories) into a single logical entity so that the different discrete parts are accessed in one operation. An example will make this clearer: Imagine the collection of 20,000 images stores data in directories named 1999, 2000, ... and within each of those there are sub-directories named 01, 02, ..., 12 and within those there are files named following the pattern YYYYMMDDD where DDD is the day number. Lets also assume that this data set is served by Hyrax and thus each of these files can be accessed by a single URL. If a person wants to read data from the fourteen files from Jan 25th to Feb 7th, they have to know this structure and apply it to the data set to figure out which URLs to use with their client, then get the data from the URLs and probably assemble it into a three dimensional data structure. If all collections were organized like this, clients could be built to perform those operations and the problem would be solved. But the variability among data set storage patterns is actually very high - a handful of data sets are stored as described by this example, but most are stored in other ways and there are enough of those 'other ways', and new 'other ways' keep emerging every day, to make customizing clients impracticable.

A solution is to aggregate the different URLs into a single three-dimensional data object and provide a data server than can operate on it without revealing its true composition. In this scenario, the person who wants data from the Jan 25th to Feb 7th asks for it by accessing the data set using the uniform data access operations supported by every data server within the system. For 100 different data sets made up of images taken over time, each can represented as a single three-dimensional data set and each can be accessed using the same operators (and thus operations) even though they are actually stored using 100 different configurations of files and databases. This is how the process of aggregation can be used to solve the data location problems that arise from heterogeneous configurations of data sets.

The THREDD Data Server (TDS) supports both DAP and aggregation using NcML and clearly shows that this approach works well in a wide variety of cases

The problem of finding data sets

This is the problem of matching a group of data sets to a schema that uniformly describes their variability. The entries that conform to this schema (aka records) cover both taxonomies and values for sets of predefined parameters. The problem here is one of heterogeneity too. Clients are optimized to search for information specific to a fixed set of problem areas (e.g., Ocean data from satellites; Biological Ocean data from fixed locations) and because of this, are built to work with specific search information. Each data set is described by a record and the collection of records is stored in a data base that can be searched. There's nothing particularly hard about this and it works well with one caveat: The form and context of the records tends to be very specific to a problem domain. Of course, information about multiple domains can be included in the same record; it's just as easy for a data base to the expanded records as for the focused ones. However, the problem lies in the making of those records. If the information they contain is very tightly focused so that only one problem domain is described, they are easiest to write but useless when searching within other domains. If the information is universal the records are essentially impossible to write. The middle ground is always a compromise between breadth of coverage and manageability. In systems built using DAP servers, there's never a requirement to add specific metadata to start serving data so the caliber of information useful for searching is highly variable. We chose this limitation because it was the best way to get the most data served. (And because there are many cases where network access to data is desirable within a group where many metadata parameters are well known).

The problem of metadata development needs to be separated from the production of the data themselves because as the data age their potential audience tends to widen. At first only those most closely connected to the data are interested in it, but over time interest often broadens until people who initially know very little about the data become potential users. As this widening of scope takes place the need form more general metadata increases. So too does the variety of clients which might be used to operate on the data. In other words, as the pool of users widens, so does the kinds of uses and as that happens the problem domains where the data may be used widens.

The challenge for a good searching system is to accommodate these changes over time. The system must support building additional metadata into the network presentation of data over time. To do this effectively, the system needs to build different types of records using more generic 'building blocks' which can be shared between several different types of records used by different types of clients.

The AIS- and Aggregation-based system described here is designed to solve both of these problems.

Design

Deliverables

The Guide: A manual for the DP that provides information about what fragments to add to a data set to support a specific metadata record type. Initially this will support EML records that can be used in Kepler
A modification of Kepler so that it will search for geo-spatial data.
The NcML-AIS (implemented as a module for the BES).
The NcML-Aggregator (implemented as a module for the BES, likely combined with the AIS)
A modification to the DAP (implemented in libdap++) that provides a way to insert XML into the DAP variable attributes Done 3/2/2009
A tool that can produce a metadata record of type X for a given data set (DAP URL). Liekly based on XSLT
A tool, bundled with the Guide, to test the acceptability of the metadata generated by the system (i.e., is the resulting EML document not only valid EML but will it work in the Kepler client).

Period of use

This will be initially developed for use by the REAP project, which has a period of performance that ends in May 2010.

Some elements will be part of a production version of the BES (the NcML-AIS/Aggregator) while other parts are primarily proof-of-concept and will not be useful beyond the REAP project (the Guide sections regarding EML). Hopefully the XSLT will survive and prove useful in other applications.

REAP Cataloging and Searching

Contents