REAP Cataloging and Searching: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 43: Line 43:


The challenge for a good searching system is to accommodate these changes over time. The system must support building additional metadata into the network presentation of data over time. To do this effectively, the system needs to build different types of records using more generic 'building blocks' which can be shared between several different types of records used by different types of clients.
The challenge for a good searching system is to accommodate these changes over time. The system must support building additional metadata into the network presentation of data over time. To do this effectively, the system needs to build different types of records using more generic 'building blocks' which can be shared between several different types of records used by different types of clients.
The AIS- and Aggregation-based system described here is designed to solve both of these problems.


== Deliverables ==
== Deliverables ==

Revision as of 23:17, 2 March 2009

Summary

The Kepler workflow client used in the REAP project has the ability to access data using DAP servers but has no way to find those servers. Data search systems built for DAP servers don't have a very good track record often because such systems do not address the twin needs of working with a fluid (rapidly changing) pool of data servers and data sets and fitting in with the basic requirement of DAP-based systems - that impact on a data provider be absolutely minimal.

In order for the impact of hosting a DAP server to be minimal, the typical data documentation (i.e. metadata) required for most searching systems is not required for data served using DAP. As a result, interfacing DAP servers to such systems is a daunting task involving lots of manual metadata entry. This effort is frustrated not only by the often baroque nature of metadata standards (e.g., FGDC) but because the data sources being described move from place to place frequently, something the begs for automated discovery and cataloging - exactly the opposite of what is provided by hand-written metadata records.

Complicating this typical scenario is the nature of most searching systems: They tend to be tailored to a specific client system. Those data providers remaining who were not deterred by the issues of complexity and frequent manual updates of metadata often are when they see that the additional metadata will satisfy the needs of only one client. It is virtually impossible to get data providers to write these metadata records, since the providers will have to write differently-formatted information for each such client. This would normally be remedied by adopting a standard and then modifying all clients to use that standard. However, standards with enough breadth to satisfy the semantic needs of a wide spectrum of clients are, as previously described, very complex.

Since it's unlikely that a 'magic' standard will appear anytime soon or that clients will drop many of their requirements for metadata vis-a-vis searching, we need a solution that will provide a way to build a uniform set of metadata at the servers which can then be assembled by different clients according to their diverse needs. It might be that some desired information is missing from some servers or some superfluous information is present at others, but the clients can build their choice of metadata records using what is present.

The companion technologies of XML and XSLT combined with a system to provide ancillary information for data sets served using DAP suggest one solution to this problem.

The system described here will use Kepler as an example client. It will build metadata records using EML, where most of the really important metadata is actually a series of XML micro documents described by ISO 19115 and other standards. The system will use a server-side solution (technology leveraged from other projects and thus more likely to be in use) to augment data sets with this information. The EML documents will be built using XSLT - EML won't be directly returned by the DAP servers and the same geo-spatial information can be used for other things (e.g., WCS 1.x). The EML will be scavenged by Metacat, which Kepler already knows how to use (with some caveats). Metacat doesn't know how to crawl DAP servers, but it can be feed URLs and we may employ TPAC's crawler to feed Metacat with URLs or DDX/EML objects.

This solution is not ideal. Data providers will still initially have to write metadata records, albeit smaller, more concise ones. However, automated crawling of servers and automated harvesting of the discovered URLs means that data set and server movement can be accommodated more effectively than with designs based on static documents.

A problem with this design is that data providers will need to know the collection of 'micro documents' needed to support one or more different client systems. We could mitigate this risk by surveying clients to find out how diverse their needs really are - not in terms of formats but in terms of content. We know from current experience with XSLT and related technologies that we can transform information stored in XML fairly easily, so supporting different textual formats is not nearly as much of a concern as differing content requirements.

Use Cases

Add information about a data set to the catalog

Search the catalog

Specific data sets for the REAP Ocean Use

Definitions

Background

For systems like the ones which use DAP the biggest data location problems are getting valid metdata that is sufficiently uniform and making a smooth transition from the initial process of location to the selection of individual parts of a chosen data set.

The problem of heterogeneous configuration

The first problem is the really the problem of finding data sets in the vernacular sense of finding and data set while the second is the problem of taking a number of ad hoc heterogeneous storage and organizational configurations and mapping them into a (mostly) uniform access mechanism. While both problems present a significant challenge, the second can be effectively addressed by aggregating the discrete elements of the actual data set's configuration (e.g., 20,000 images in a file system where different years and months are each in nested directories) into a single logical entity so that the different discrete parts are accessed in one operation. An example will make this clearer: Imagine the collection of 20,000 images stores data in directories named 1999, 2000, ... and within each of those there are sub-directories named 01, 02, ..., 12 and within those there are files named following the pattern YYYYMMDDD where DDD is the day number. Lets also assume that this data set is served by Hyrax and thus each of these files can be accessed by a single URL. If a person wants to read data from the fourteen files from Jan 25th to Feb 7th, they have to know this structure and apply it to the data set to figure out which URLs to use with their client, then get the data from the URLs and probably assemble it into a three dimensional data structure. If all collections were organized like this, clients could be built to perform those operations and the problem would be solved. But the variability among data set storage patterns is actually very high - a handful of data sets are stored as described by this example, but most are stored in other ways and there are enough of those 'other ways', and new 'other ways' keep emerging every day, to make customizing clients impracticable.

A solution is to aggregate the different URLs into a single three-dimensional data object and provide a data server than can operate on it without revealing its true composition. In this scenario, the person who wants data from the Jan 25th to Feb 7th asks for it by accessing the data set using the uniform data access operations supported by every data server within the system. For 100 different data sets made up of images taken over time, each can represented as a single three-dimensional data set and each can be accessed using the same operators (and thus operations) even though they are actually stored using 100 different configurations of files and databases. This is how the process of aggregation can be used to solve the data location problems that arise from heterogeneous configurations of data sets.

The problem of finding data sets

This is the problem of matching a group of data sets to a schema that uniformly describes its variability. The entries that conform to this schema (aka records) cover both taxonomies and values for sets of predefined parameters. The problem here is one of heterogeneity too. Clients are optimized to search for information specific to a fixed set of problem areas (e.g., Ocean data from satellites; Biological Ocean data from fixed locations) and because of this, are built to work with specific search information. Each data set is described by a record and the collection of records is stored in a data base that can be searched. There's nothing particularly hard about this and it works well with one caveat: The form and context of the records tends to be very specific to a problem domain. Of course, information about multiple domains can be included in the same record; it's just as easy for a data base to the expanded records as for the focused ones. However, the problem lies in the making of those records. If the information they contain is very tightly focused so that only one problem domain is described, they are easiest to write but useless when searching within other domains. If the information is universal the records are essentially impossible to write. The middle ground is always a compromise between breadth of coverage and manageability. In systems built using DAP servers, there's never a requirement to add specific metadata to start serving data so the caliber of information useful for searching is highly variable. We chose this limitation because it was the was to get the most data served. (And because there are many cases where network access to data is desirable within a group where many metadata parameters are well known).

The problem of metadata development needs to be separated from the production of the data themselves because as the data age their potential audience tends to widen. At first only those most closely connected to the data are interested in it, but over time interest often broadens until people who initially know very little about the data become potential users. As this widening of scope takes place the need form more general metadata increases. So too does the variety of clients which might be used to operate on the data. In other words, as the pool of users widens, so does the kinds of uses and as that happens the problem domains where the data may be used widens.

The challenge for a good searching system is to accommodate these changes over time. The system must support building additional metadata into the network presentation of data over time. To do this effectively, the system needs to build different types of records using more generic 'building blocks' which can be shared between several different types of records used by different types of clients.

The AIS- and Aggregation-based system described here is designed to solve both of these problems.

Deliverables

Period of use