REAP Cataloging and Searching: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 21: Line 21:
====[[Add information about a data set to the catalog]]====
====[[Add information about a data set to the catalog]]====
====[[Search the catalog]]====
====[[Search the catalog]]====
====[[Use a data set found using the search system]]====
====[[Specific data sets for the REAP Ocean Use]]====
 


== Definitions ==
== Definitions ==

Revision as of 20:07, 2 March 2009

Summary

The Kepler workflow client used in the REAP project has the ability to access data using DAP servers but has no way to find those servers. Data search systems built for DAP servers don't have a very good track record often because such systems do not address the twin needs of working with a fluid (rapidly changing) pool of data servers and data sets and fitting in with the basic requirement of DAP-based systems - that impact on a data provider be absolutely minimal.

In order for the impact of hosting a DAP server to be minimal, the typical data documentation (i.e. metadata) required for most searching systems is not required for data served using DAP. As a result, interfacing DAP servers to such systems is a daunting task involving lots of manual metadata entry. This effort is frustrated not only by the often baroque nature of metadata standards (e.g., FGDC) but because the data sources being described move from place to place frequently, something the begs for automated discovery and cataloging - exactly the opposite of what is provided by hand-written metadata records.

Complicating this typical scenario is the nature of most searching systems: They tend to be tailored to a specific client system. Those data providers remaining who were not deterred by the issues of complexity and frequent manual updates of metadata often are when they see that the additional metadata will satisfy the needs of only one client. It is virtually impossible to get data providers to write these metadata records, since the providers will have to write differently-formatted information for each such client. This would normally be remedied by adopting a standard and then modifying all clients to use that standard. However, standards with enough breadth to satisfy the semantic needs of a wide spectrum of clients are, as previously described, very complex.

Since it's unlikely that a 'magic' standard will appear anytime soon or that clients will drop many of their requirements for metadata vis-a-vis searching, we need a solution that will provide a way to build a uniform set of metadata at the servers which can then be assembled by different clients according to their diverse needs. It might be that some desired information is missing from some servers or some superfluous information is present at others, but the clients can build their choice of metadata records using what is present.

The companion technologies of XML and XSLT combined with a system to provide ancillary information for data sets served using DAP suggest one solution to this problem.

The system described here will use Kepler as an example client. It will build metadata records using EML, where most of the really important metadata is actually a series of XML micro documents described by ISO 19115 and other standards. The system will use a server-side solution (technology leveraged from other projects and thus more likely to be in use) to augment data sets with this information. The EML documents will be built using XSLT - EML won't be directly returned by the DAP servers and the same geo-spatial information can be used for other things (e.g., WCS 1.x). The EML will be scavenged by Metacat, which Kepler already knows how to use (with some caveats). Metacat doesn't know how to crawl DAP servers, but it can be feed URLs and we may employ TPAC's crawler to feed Metacat with URLs or DDX/EML objects.

This solution is not ideal. Data providers will still initially have to write metadata records, albeit smaller, more concise ones. However, automated crawling of servers and automated harvesting of the discovered URLs means that data set and server movement can be accommodated more effectively than with designs based on static documents.

A problem with this design is that data providers will need to know the collection of 'micro documents' needed to support one or more different client systems. We could mitigate this risk by surveying clients to find out how diverse their needs really are - not in terms of formats but in terms of content. We know from current experience with XSLT and related technologies that we can transform information stored in XML fairly easily, so supporting different textual formats is not nearly as much of a concern as differing content requirements.

Use Cases

Add information about a data set to the catalog

Search the catalog

Specific data sets for the REAP Ocean Use

Definitions

Background

Deliverables

Period of use