REAP Cataloging and Searching: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 3: Line 3:
The Kepler workflow client used in the REAP project has the ability to access data using DAP servers but has no way to find those servers. Data search systems built for DAP servers don't have a very good track record often because such systems do not address the twin needs of working with a fluid (rapidly changing) pool of data servers and data sets and fitting in with the basic requirement of DAP-based systems - that impact on a data provider be absolutely minimal.
The Kepler workflow client used in the REAP project has the ability to access data using DAP servers but has no way to find those servers. Data search systems built for DAP servers don't have a very good track record often because such systems do not address the twin needs of working with a fluid (rapidly changing) pool of data servers and data sets and fitting in with the basic requirement of DAP-based systems - that impact on a data provider be absolutely minimal.


In order for impact to be minimal, the typical data documentation (i.e. ''metadata'') required for most searching systems is not required for data served using DAP. As a result, interfacing DAP servers to such systems is a daunting task involving lots of manual metadata entry. This effort is frustrated not only by the often baroque nature of metadata standards (e.g., FGDC) but because the data sources being described ''move'' from place to place frequently and that is something the begs for automated discovery and cataloging - exactly the opposite what is provided by hand-written metadata records.
In order for the impact of hosting a DAP server to be minimal, the typical data documentation (i.e. ''metadata'') required for most searching systems is not required for data served using DAP. As a result, interfacing DAP servers to such systems is a daunting task involving lots of manual metadata entry. This effort is frustrated not only by the often baroque nature of metadata standards (e.g., FGDC) but because the data sources being described ''move'' from place to place frequently, something the begs for automated discovery and cataloging - exactly the opposite of what is provided by hand-written metadata records.


Complicating this typical scenario is the nature of most searching systems: They tend to be one-off code tailored to a specific client and since they satisfy the needs of only one client, it is virtually impossible to get data providers to write metadata for them, since the providers would have to write differently formatted information for each such client. This would normally be remedied by adopting a standard and then modifying all clients to use that standard. However, standards with enough breadth to satisfy the semantic needs of a wide spectrum of clients are, as previously described, very complex and thus hard to use.
Complicating this typical scenario is the nature of most searching systems: They tend to be tailored to a specific client system. Those data providers remaining who were not deterred by the issues of complexity and frequent manual updating of metadata often are when they see that the additional metadata will satisfy the needs of only one client. It is virtually impossible to get data providers to write these metadata records, since the providers would have to write differently formatted information for each such client. This would normally be remedied by adopting a standard and then modifying all clients to use that standard. However, standards with enough breadth to satisfy the semantic needs of a wide spectrum of clients are, as previously described, very complex and thus hard to use.


Since it's unlikely that a 'magic' standard will appear anytime soon or that clients will drop many of their requirements for metadata vis-a-vis searching, we need a solution that will provide a way to build a ''uniform'' set of metadata at the servers which can then be assembled by different clients according to their diverse needs. It might be that some desired information is missing from some servers or some superfluous information is present at others, but the clients can build their choice of metadata records using what is present.
Since it's unlikely that a 'magic' standard will appear anytime soon or that clients will drop many of their requirements for metadata vis-a-vis searching, we need a solution that will provide a way to build a ''uniform'' set of metadata at the servers which can then be assembled by different clients according to their diverse needs. It might be that some desired information is missing from some servers or some superfluous information is present at others, but the clients can build their choice of metadata records using what is present.

Revision as of 23:37, 25 February 2009

Summary

The Kepler workflow client used in the REAP project has the ability to access data using DAP servers but has no way to find those servers. Data search systems built for DAP servers don't have a very good track record often because such systems do not address the twin needs of working with a fluid (rapidly changing) pool of data servers and data sets and fitting in with the basic requirement of DAP-based systems - that impact on a data provider be absolutely minimal.

In order for the impact of hosting a DAP server to be minimal, the typical data documentation (i.e. metadata) required for most searching systems is not required for data served using DAP. As a result, interfacing DAP servers to such systems is a daunting task involving lots of manual metadata entry. This effort is frustrated not only by the often baroque nature of metadata standards (e.g., FGDC) but because the data sources being described move from place to place frequently, something the begs for automated discovery and cataloging - exactly the opposite of what is provided by hand-written metadata records.

Complicating this typical scenario is the nature of most searching systems: They tend to be tailored to a specific client system. Those data providers remaining who were not deterred by the issues of complexity and frequent manual updating of metadata often are when they see that the additional metadata will satisfy the needs of only one client. It is virtually impossible to get data providers to write these metadata records, since the providers would have to write differently formatted information for each such client. This would normally be remedied by adopting a standard and then modifying all clients to use that standard. However, standards with enough breadth to satisfy the semantic needs of a wide spectrum of clients are, as previously described, very complex and thus hard to use.

Since it's unlikely that a 'magic' standard will appear anytime soon or that clients will drop many of their requirements for metadata vis-a-vis searching, we need a solution that will provide a way to build a uniform set of metadata at the servers which can then be assembled by different clients according to their diverse needs. It might be that some desired information is missing from some servers or some superfluous information is present at others, but the clients can build their choice of metadata records using what is present.

The companion technologies of XML and XSLT combined with a system to provide ancillary information for data sets served using DAP suggest one solution to this problem.

The system described here will use Kepler as an example client. It will build metadata records using EML, where most of the really important metadata is actually a series of XML micro documents described by ISO 19115. The system will use a server-side solution (technology leveraged from other projects and thus more likely to be in wide use after time) to augment data sets with this information. The EML documents will be built using XSLT - EML won't be directly returned by the DAP servers and the same geo-spatial information can be used for other things (e.g., WCS 1.x). The EML will be scavenged by Metacat, which Kepler already knows how to use (with some caveats). Metacat doesn't know how to crawl DAP servers, but it can be feed URLs and we may employ TPAC's crawler to feed Metacat with URLs or DDX/EML objects. TBD.

Use Cases

Add information about a data set to the catalog

Search the catalog

Use a data set found using the search system

Definitions

Background

Deliverables

Period of use