Difference between revisions of "Hyrax - THREDDS Configuration"

From OPeNDAP Documentation
(Representing Collections (directories): The '''''<datsetScan>''''' element)
(path attribute)
 
(55 intermediate revisions by 2 users not shown)
Line 1: Line 1:
@TODO: '''Revise this page to improve clarity and usability'''
+
== Overview ==
 +
Hyrax now uses its own implementation of the THREDDS catalog services and supports most of the THREDDS catalog service stack. The implementation relies on two DispatchHandlers in the OLFS and utilizes XSLT to provide HTML versions (presentation views) for human consumption.
  
This release of Hyrax supports the complete THREDDS catalog service stack. THREDDS catalogs are controlled by a ''catalog.xml'' file located in the (persistent) content directory for the OLFS (More on that here). Rather than provide an exhaustive explanation of the THREDDS catalog functionality and configuration I will appeal to the existing documents provided by our fine colleagues at [http://www.unidata.ucar.edu/projects/THREDDS/ UNIDATA]:
+
# Dynamic THREDDS catalogs for holdings provided by the BES are provided by the opendap.bes.BESThreddsDispatchHandler.  
 
+
# Static THREDDS catalogs are provided by the opendap.threddsHandler.StaticCatalogDispatch. The static catalogs allow catalog "graphs" to be decoupled from the filesystem "graph" of the data holdings, thus allowing data providers the ability to present and organize data collections independently of how they are organized in the underlying filesystem.
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/index.html#catalog Catalog Basics]
 
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/InvCatalogSpec.html Catalog Specification]
 
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/Primer.html Catalog Primer]
 
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/cataloggen/devel/datasetScanElement.html ''datasetScan'' Element]
 
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/InvCatalogSpec.html#dataset ''dataset'' Element]
 
 
 
Did you read all that? Excellent!
 
  
 +
Static THREDDS catalogs are "rooted" in a master catalog file, ''catalog.xml'',  located in the (persistent) content directory for the OLFS (Typically $CATALINA_HOME/content/opendap). The default ''catalog.xml'' that comes with Hyrax contains a simple catalogRef element that points to the dynamic THREDDS catalogs generated from the BES holdings. The default catalog example also contains a (commented out) datasetScan element that provides (if enabled) a simple demonstration of the datasetScan capabilities. Additional catalog components may be added to the ''catalog.xml'' file to build (potentially large) static catalogs.
  
----
+
* THREDDS datasetScan elements are now fully supported and can be used as a tool for altering the catalog presentation of any part of the BES catalog. These alterations include (but are not limited too) renaming, auto proxy generation, filtering, and metadata injection.
==Configuration Instructions==
 
 
 
*The THREDDS catalog configuration is stored in the file '''$CATALINA_HOME/content/opendap/catalog.xml'''<br /><br />
 
*Each item that appears in the top level directory of the BES (BES.Catalog.catalog.RootDirectory and BES.Data.RootDirectory)  should have a corresponding element as a child of the top level ''<catalog>'' element in the '''catalog.xml''' file. Collections (aka directories) are represented by a ''<datasetScan>'' element. Granules (files) are represented as ''<dataset>'' elements. It is not possible to map the top level directory of the BES (BES.Catalog.catalog.RootDirectory and BES.Data.RootDirectory) to a single <datasetScan> element in the THREDDS catalog.<br /><br />
 
 
 
=== Representing Collections (directories): The '''''<datsetScan>'''''  element ===
 
 
 
# For each collection that appears in the top level directory of the BES (BES.Catalog.catalog.RootDirectory and BES.Data.RootDirectory)  you '''SHOULD''' create a corresponding ''<datasetScan>'' in the '''catalog.xml''' file.<br /><br />''The THREDDS catalog views will NOT include top level collections for which this is not done!'' <br /><br />
 
# The ''serviceName'' attribute in the <''datasetScan''> element must be set to "''OPeNDAP-Hyrax''" corresponding to the ''<service>'' element at the top of the file whose name element has the same value. <br /><br />
 
#* '''''serviceName="OPeNDAP-Hyrax"'''''<br /><br />
 
# Each ''<datasetScan>'' element has three crucial attributes that must be set to correspond to the the collection that is meant to be traversed: '''''location''''', '''''path''''', and '''''name'''''. These attributes should be set as follows:<br /><br />
 
#* '''''location="/bes/collectionName"''''' <br />&nbsp;&nbsp;&nbsp;&nbsp;Where collectionName is the name of the top level collection in the BES that is to be traversed. The prefix ''/bes/'' is required.<br /><br />
 
#* '''''path="collectionName"'''''<br />&nbsp;&nbsp;&nbsp;&nbsp;Where collectionName is the same value as used in the ''location'' attribute. The collectionName '''MUST NOT''' start with a "/" character.<br /><br />
 
#* '''''name="collectionName"'''''<br /> &nbsp;&nbsp;&nbsp;&nbsp;Where collectionName is the same value as used in the ''path'' attribute. The ''name'' attribute is used for presentation and could be set differently from the '''''collectionName''''', but doing so will most likely lead to confusion for people navigating the OPeNDAP contents.html view of the server and the THREDDS catalog view of the server.<br /><br />
 
# In each <''datasetScan''> element that you create you '''MUST''' include the following element: <''crawlableDatasetImpl className="opendap.bes.BESCrawlableDataset" ''/> This is the ''CrawlableDataset'' implementation that allows the THREDDS implementation to work with the BES.<br /><br />
 
# You should apply a filter to the data that coincides with the value of the "''BES.Catalog.catalog.TypeMatch''" for the data types being served. I suggest that you make the filter expose ALL of the data types served by the BES. See the [http://www.unidata.ucar.edu/projects/THREDDS/tech/cataloggen/devel/datasetScanElement.html THREDDS pages on the DatasetScan Element] for filter details. The point of this to remove files from the catalog view that are NOT being served as OPeNDAP data. For example README files.<br /><br />
 
# The ''<datasetScan>'' is allowed to contain a THREDDS ''<metadata>'' element. The details of its use can be found [http://www.unidata.ucar.edu/projects/THREDDS/tech/cataloggen/devel/datasetScanElement.html HERE]<br /><br />
 
 
 
=== Representing Granules (files): The '''''<dataset>''''' element ===
 
# For each granule (file)  that appears in the top level directory of the BES (BES.Catalog.catalog.RootDirectory and BES.Data.RootDirectory)  you '''SHOULD''' create a corresponding ''<dataset>'' in the '''catalog.xml''' file.<br /><br />''The THREDDS catalog views will NOT include top level granules for which this is not done!'' <br /><br />
 
# Each ''<dataset>'' element has three crucial attributes that must be set to correspond to the the collection that is meant to be traversed: '''''name''''', '''''urlPath''''', and '''''ID'''''. These attributes should be set as follows:<br /><br />
 
#* '''''name="granuleName"''''' <br />&nbsp;&nbsp;&nbsp;&nbsp;Where granuleName is the name of the top level granule (file) in the BES that is to be included in the catalog.<br /><br />
 
#* '''''ID="granuleName"''''' <br />&nbsp;&nbsp;&nbsp;&nbsp;Where granuleName is the name of the top level granule (file) in the BES that is to be included in the catalog.<br /><br />
 
#* '''''urlPath="granuleName"''''' <br />&nbsp;&nbsp;&nbsp;&nbsp;Where granuleName is the name of the top level granule (file) in the BES that is to be included in the catalog.<br /><br />
 
# In each <''dataset''> element that you create you '''MUST''' include the following element:<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;'''''<serviceName>OPeNDAP-Hyrax</serviceName>''''' <br /><br />Where the text value of the ''<serviceName>'' element is equal to the value of the ''name'' attribute of the corresponding ''<service>'' element at the top of the document.<br /><br />
 
# You can find more about the THREDDS ''<dataset>'' element [http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/InvCatalogSpec.html#dataset HERE]
 
  
 +
[[THREDDS_using_XSLT|More details about the handlers, their configuration options, and other information can be found here.]]
  
  
 +
=== THREDDS Catalog Documentation ===
  
 +
Rather than provide an exhaustive explanation of the THREDDS catalog functionality and configuration I will appeal to the existing documents provided by our fine colleagues at [http://www.unidata.ucar.edu/projects/THREDDS/ UNIDATA]:
  
 +
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/TDS.html#Catalogs Catalog Basics]
 +
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/InvCatalogSpec.html Client Catalog Specification] Describes what THREDDS catalog components should be produced by servers.
 +
* [http://www.unidata.ucar.edu/software/thredds/current/tds/tutorial/CatalogPrimer.html Catalog Primer]
 +
* [http://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogServerSpec.html#datasetScan_Element Server catalog specification] can help you understand the rules for constructing proper datasetScan elements in your catalog.
 +
* [http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/InvCatalogSpec.html#dataset ''dataset'' Element]
 +
* [http://www.unidata.ucar.edu/software/thredds/v4.6/tds/reference/DatasetScan.html ''datasetScan'' configuration] (applies to Hyrax as well).
 +
  
 
----
 
----
  
=Reinitializing THREDDS=
+
==Configuration Instructions==
The THREDDS catalog is read when Tomcat is started. Hyrax will check the last modifed date of the catalog.xml file prior to responding to a THREDDS catalog request. If the last modifed date has changed since Tomcat started, then Hyrax will reload all of the THREDDS catalog information.
 
 
 
So if you make changes to ANY of the THREDDS catalog files in the ''$CATALINA_HOME/content/opendap'' directory tree, then there are two ways for you to get Hyrax to update:
 
 
 
# Change the last modified date of the file: ''$CATALINA_HOME/content/opendap/catalog.xml''<br /><br /> This can be accomplished with the unix command "touch" command: <code>touch $CATALINA_HOME/content/opendap/cataog.xml</code> This will cause Hyrax to reload all of the THREDDS catalogs the next time that a THREDDS catalog request is made (You might want to make this request yourself if you have a big THREDDS catalog configuration so that a knowing user doesn't have to wait for a response while Hyrax is working) <br /><br />OR you could:<br /><br />
 
#Restart Tomcat.
 
 
 
 
 
-----
 
 
 
=THREDDS Catalog Examples=
 
 
 
===Example 1===
 
Here is an example ''catalog.xml'' file for a Hyrax installation in which the top level of the BES shows only ONE collection called "''data''":
 
<pre>&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
 
&lt;catalog name=&quot;Hyrax Test Catalog&quot;
 
        xmlns=&quot;http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0&quot;
 
        xmlns:xlink=&quot;http://www.w3.org/1999/xlink&quot;&gt;
 
 
 
    &lt;!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --&gt;
 
 
 
    &lt;service name=&quot;OPeNDAP-Hyrax&quot; serviceType=&quot;OPeNDAP&quot; base=&quot;/opendap/&quot;/&gt;
 
 
 
    &lt;datasetScan location=&quot;/bes/data&quot; path=&quot;data&quot; name=&quot;data&quot; serviceName=&quot;OPeNDAP-Hyrax&quot;&gt;
 
 
 
      &lt;crawlableDatasetImpl className=&quot;opendap.bes.BESCrawlableDataset&quot; /&gt;
 
  
          &lt;filter&gt;
+
* The current default  ([[Hyrax_-_OLFS_Configuration#olfs.xml_Configuration_File |olfs.xml]]) file comes with THREDDS configured correctly.
              &lt;exclude wildcard=&quot;.*&quot; atomic=&quot;true&quot; collection=&quot;true&quot; /&gt;
+
* The THREDDS master catalog is stored in the file '''$CATALINA_HOME/content/opendap/catalog.xml''' it can be edited to provide additional static catalog access.
              &lt;include wildcard=&quot;*&quot; /&gt;
 
          &lt;/filter&gt;
 
          &lt;addDatasetSize /&gt;
 
  
          &lt;metadata inherited=&quot;true&quot;&gt;
+
== datasetScan Support ==
              &lt;serviceName&gt;OPeNDAP-Hyrax&lt;/serviceName&gt;
+
The '''datasetScan''' element is a powerful tool that can be used to sculpt the catalog's presentation of the BES catalog content. The Hyrax implementation has a couple of key points that need to be considered when developing an instance of the '''datasetScan''' element in your THREDDS catalog.
              &lt;authority&gt;opendap.org&lt;/authority&gt;
 
              &lt;dataType&gt;Random&lt;/dataType&gt;
 
          &lt;/metadata&gt;
 
    &lt;/datasetScan&gt;
 
  
    &lt;!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --&gt;
+
=== location attribute===
&lt;/catalog&gt;
+
The '''location''' attribute specifies the place in the BES catalog graph that the '''datasetScan''' will be rooted. This value ''must be'' expressed relative to the BES catalog root (BES.Catalog.catalog.RootDirectory) and not in terms of the underlying BES host file system.
</pre>
+
;Example
 +
:If ''BES.Catalog.catalog.RootDirectory=/usr/share/hyrax'' and the data directory to which you wish to apply the '''datasetScan''' is (in filesystem terms) located at ''/Users/share/hyrax/data/nc'' then the associated '''datasetScan''' element's '''location''' attribute would have a value of ''/data/nc''
  
===Example 2===
+
<div  style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<datasetScan name="DatasetScanExample" path="hyrax" location="/data/nc">
 +
</source>
 +
</div>
  
Here is an example ''catalog.xml'' file for a Hyrax installation in which the top level of the BES shows contains 4 collection called "''nc''", "''hdf''", and "''ff''":
+
=== name attribute===
<pre>
+
The '''name''' attribute specifies name that will be used to in the presentation (HTML) view for the catalog containing the '''datasetScan''' is viewed.  
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
 
&lt;catalog name=&quot;Hyrax Test Catalog&quot;
 
        xmlns=&quot;http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0&quot;
 
        xmlns:xlink=&quot;http://www.w3.org/1999/xlink&quot;&gt;
 
  
    &lt;service name=&quot;OPeNDAP-Hyrax&quot; serviceType=&quot;OPeNDAP&quot; base=&quot;/opendap/&quot;/&gt;
+
=== path attribute===
 +
The '''path''' attribute specifies the place in the THREDDS catalog graph that the '''datasetScan''' will be rooted. In effect it is a relative URL for the service. If '''path''' begins with a "/" then it is an absolute path - rooted at the server and port of the web server. The values of the '''path''' attribute should NEVER contain "catalog.xml" or "catalog.html". The service will create these endpoints dynamically.
  
    &lt;!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --&gt;
+
; Relative path example
 +
: Consider a catalog accessed with the URL: http://localhost:8080/opendap/thredds/v27/Landsat/catalog.xml and that contains this '''datasetScan''' element:
  
    &lt;datasetScan location=&quot;/bes/nc&quot; path=&quot;nc&quot; name=&quot;nc&quot; serviceName=&quot;OPeNDAP-Hyrax&quot;&gt;
+
<div  style="padding-left: 20px;width: 95%;">
 +
<datasetScan name="DatasetScanExample" path="hyrax" location="/data/nc" />
 +
</source>
 +
</div>
  
      &lt;crawlableDatasetImpl className=&quot;opendap.bes.BESCrawlableDataset&quot; /&gt;
+
:In the client catalog the '''datasetScan''' becomes this '''catalogRef''' element:
  
          &lt;filter&gt;
+
<div  style="padding-left: 20px;width: 95%;">
              &lt;exclude wildcard=&quot;.*&quot; atomic=&quot;true&quot; collection=&quot;true&quot; /&gt;
+
<source lang="xml">
              &lt;include wildcard=&quot;*.nc&quot; /&gt;
+
<thredds:catalogRef
          &lt;/filter&gt;
+
    name="DatasetScanExample"
          &lt;addDatasetSize /&gt;
+
    xlink:title="DatasetScanExample"
 +
    xlink:href="hyrax/catalog.xml"
 +
    xlink:type="simple"
 +
/>
 +
</source>
 +
</div>
  
          &lt;metadata inherited=&quot;true&quot;&gt;
 
              &lt;serviceName&gt;OPeNDAP-Hyrax&lt;/serviceName&gt;
 
              &lt;authority&gt;opendap.org&lt;/authority&gt;
 
              &lt;dataType&gt;Random&lt;/dataType&gt;
 
          &lt;/metadata&gt;
 
    &lt;/datasetScan&gt;
 
  
    &lt;!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --&gt;
+
:And the top of '''datasetScan''' catalog graph will be found at the URL
 +
::  http://localhost:8080/opendap/thredds/v27/Landsat/hyrax/catalog.xml 
  
    &lt;datasetScan location=&quot;/bes/hdf&quot; path=&quot;hdf&quot; name=&quot;hdf&quot; serviceName=&quot;OPeNDAP-Hyrax&quot;&gt;
+
; Absolute path examples
 +
: Consider a catalog accessed with the URL: http://localhost:8080/opendap/thredds/v27/Landsat/catalog.xml and that contains this '''datasetScan''' element:
  
      &lt;crawlableDatasetImpl className=&quot;opendap.bes.BESCrawlableDataset&quot; /&gt;
+
<div  style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<datasetScan name="DatasetScanExample" path="/hyrax" location="/data/nc" />
 +
</source>
 +
</div>
  
          &lt;filter&gt;
+
:In the client catalog the '''datasetScan''' becomes this '''catalogRef''' element:
              &lt;exclude wildcard=&quot;.*&quot; atomic=&quot;true&quot; collection=&quot;true&quot; /&gt;
 
              &lt;include wildcard=&quot;*.hdf&quot; /&gt;
 
          &lt;/filter&gt;
 
          &lt;addDatasetSize /&gt;
 
  
          &lt;metadata inherited=&quot;true&quot;&gt;
+
<div  style="padding-left: 20px;width: 95%;">
              &lt;serviceName&gt;OPeNDAP-Hyrax&lt;/serviceName&gt;
+
<source lang="xml">
              &lt;authority&gt;opendap.org&lt;/authority&gt;
+
<thredds:catalogRef
              &lt;dataType&gt;Random&lt;/dataType&gt;
+
    name="DatasetScanExample"
          &lt;/metadata&gt;
+
    xlink:title="DatasetScanExample"
    &lt;/datasetScan&gt;
+
    xlink:href="/hyrax/catalog.xml"
 +
    xlink:type="simple"
 +
/>
 +
</source>
 +
</div>
  
    &lt;!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --&gt;
+
:Then the top of '''datasetScan''' catalog graph will be found at the URL
 +
:: http://localhost:8080/hyrax/catalog.xml
 +
: '''''Which is probably not what you want!''''' This '''catalogRef''' directs the catalog crawler away from the Hyrax THREDDS service and to an undefined (as far as Hyrax is concerned) endpoint, one that most likely will generate a 404 (Not Found) response from the Web Server.
  
    &lt;datasetScan location=&quot;/bes/ff&quot; path=&quot;ff&quot; name=&quot;ff&quot; serviceName=&quot;OPeNDAP-Hyrax&quot;&gt;
+
: When using absolute paths you must be sure to prefix the path with the Hyrax THREDDS service path or you will direct the clients away from the service. In these examples the Hyrax THREDDS service path would be ''/opendap/thredds/" (look at the URLs in the above examples) If we change the '''datasetScan''' path attribute value to ''/opendap/thredds/myDatasetScan'':
  
      &lt;crawlableDatasetImpl className=&quot;opendap.bes.BESCrawlableDataset&quot; /&gt;
+
<div  style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<datasetScan name="DatasetScanExample" path="'/opendap/thredds/myDatasetScan" location="/data/nc" />
 +
</source>
 +
</div>
  
          &lt;filter&gt;
+
:In the client catalog the '''datasetScan''' becomes this '''catalogRef''' element:
              &lt;exclude wildcard=&quot;.*&quot; atomic=&quot;true&quot; collection=&quot;true&quot; /&gt;
 
              &lt;include wildcard=&quot;*.dat&quot; /&gt;
 
          &lt;/filter&gt;
 
          &lt;addDatasetSize /&gt;
 
  
          &lt;metadata inherited=&quot;true&quot;&gt;
+
<div  style="padding-left: 20px;width: 95%;">
              &lt;serviceName&gt;OPeNDAP-Hyrax&lt;/serviceName&gt;
+
<source lang="xml">
              &lt;authority&gt;opendap.org&lt;/authority&gt;
+
<thredds:catalogRef
              &lt;dataType&gt;Random&lt;/dataType&gt;
+
    name="DatasetScanExample"
          &lt;/metadata&gt;
+
    xlink:title="DatasetScanExample"
    &lt;/datasetScan&gt;
+
    xlink:href="/opendap/thredds/myDatasetScan/catalog.xml"
 +
    xlink:type="simple"
 +
/>
 +
</source>
 +
</div>
  
    &lt;!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --&gt;
+
:Now the top of '''datasetScan''' catalog graph will be found at the URL
&lt;/catalog&gt;
+
:: http://localhost:8080/opendap/thredds/myDatasetScan/catalog.xml
</pre>
+
: which keeps the URL referencing the Hyrax THREDDS service and not some other part of the web service stack.
  
 +
=== useHyraxServices attribute ===
 +
The Hyrax version of the '''datasetScan''' element employs the extra attribute '''useHyraxServices'''.  This allows the '''datasetScan''' to  automatically generate Hyrax data services definitions and access links for datasets in the catalog. The '''datasetScan''' can be used to augment the list of services (when '''useHyraxServices''' is set to true) or it can be used to completely replace the Hyrax service stack (when '''useHyraxServices''' is set to false).
 +
* If no services are referenced in the '''datasetScan''' and '''useHyraxServices''' is set to true, then Hyrax will provide catalogs with service definitions and access elements for all the datasets that the BES identifies as data.
 +
* If no services are referenced in the '''datasetScan''' and '''useHyraxServices''' is set to false, then the catalogs generated by the '''datasetScan''' will have ''no service definitions or access elements''.
 +
By default '''useHyraxServices''' is set to true.
  
-----
+
=== Functions ===
 +
[http://www.unidata.ucar.edu/software/thredds/v4.6/tds/reference/DatasetScan.html DatasetScan allows you to apply the following functions to the names of the datasets in the datasetScan catalog graph.]
  
=In Particular Note The Following=
+
==== Filter ====
'''1.''' The line in which the ''CrawlableDataset'' implementation is defined:
+
: A datasetScan element can specify which files and directories it will include with a filter element (also [http://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogServerSpec.html see THREDDS server catalog spec] for details). The filter element allows users to specify which datasets are to be included in the generated catalogs. A filter element can contain any number of include and exclude elements. Each include or exclude element may contain either a wildcard or a regExp attribute. If the given wildcard pattern or regular expression matches a dataset name, that dataset is included or excluded as specified. By default, includes and excludes apply only to atomic datasets (regular files). You can specify that they apply to atomic and/or collection datasets (directories) by using the atomic and collection attributes.
<pre>
 
    &lt;crawlableDatasetImpl className="opendap.bes.BESCrawlableDataset" /&gt;
 
</pre>
 
Identifies the correct ''CrawlableDataset'' class for Hyrax - the one that works with the BES to automatically generate catalogs.  
 
  
 +
<div  style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<filter>
 +
    <exclude wildcard="*not_currently_supported" />
 +
    <include regExp="/data/h5/dir2" collection="true" />
 +
</filter>
 +
</source>
 +
</div>
  
'''2.''' In the <''datasetScan''> element the location attribute's value '''MUST''' begin with ''/bes''. So, if the top level collection in the BES contains 4 sub collections they may each be identified using a separate <''datasetScan''> element like so:
+
==== Sort ====
<pre>
+
: Datasets at each collection level are listed in ascending order by name. With a sort element you can specify that they are to be sorted in reverse order:
&lt;datasetScan location=&quot;/bes/nc&quot; path=&quot;nc&quot; name=&quot;nc&quot;  serviceName=&quot;OPeNDAP-Hyrax&quot;&gt; . . . &lt;/datasetScan&gt;
+
<div style="padding-left: 20px;width: 95%;">
&lt;datasetScan location=&quot;/bes/hdf&quot; path=&quot;hdf&quot; name=&quot;hdf&quot;  serviceName=&quot;OPeNDAP-Hyrax&quot;&gt; . . . &lt;/datasetScan&gt;
+
<source lang="xml">
&lt;datasetScan location=&quot;/bes/jg&quot;  path=&quot;jg&quot;  name=&quot;jg&quot;  serviceName=&quot;OPeNDAP-Hyrax&quot;&gt; . . . &lt;/datasetScan&gt;
+
<sort>
&lt;datasetScan location=&quot;/bes/ff&quot; path=&quot;ff&quot;  name=&quot;ff&quot;  serviceName=&quot;OPeNDAP-Hyrax&quot;&gt; . . . &lt;/datasetScan&gt;
+
    <lexigraphicByName increasing="false" />
</pre> Where each <''datasetScan''> element may have it's own filter and inheritance rules. You MUST NOT lump them all into one <''datasetScan''> element with one set of filter rules like so:
+
</sort>
<pre> &lt;datasetScan location=&quot;/bes&quot;     path=&quot;DATA&quot;  name=&quot;DATA&quot;  serviceName=&quot;OPeNDAP-Hyrax&quot;&gt; . . . &lt;/datasetScan&gt;</pre> Because it does not work. If you want them all to be in one a single collection then configure the BES so that the BES.Catalog.catalog.RootDirectory and BES.Data.RootDirectory have a single top level collection (see Example 1)
+
</source>
 +
</div>
 +
==== Namer ====
 +
: If no namer element is specified, all datasets are named with the corresponding BES catalog dataset name. By adding a namer element, you can specify more human readable dataset names.
 +
<div style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<namer>
 +
    <regExpOnName regExp="/data/he/dir1" replaceString="AVHRR" />
 +
    <regExpOnName regExp="(.*)\.h5" replaceString="$1.hdf5" />
 +
     <regExpOnName regExp="(.*)\.he5" replaceString="$1.hdf5_eos" />
 +
    <regExpOnName regExp="(.*)\.nc" replaceString="$1.netcdf" />
 +
</namer>
 +
</source>
 +
</div>
  
 +
==== addTimeCoverage ====
 +
: A datasetScan element may contain an addTimeCoverage element. The addTimeCoverage element indicates that a timeCoverage metadata element should be added to each dataset in the collection and describes how to determine the time coverage for each dataset in the collection.
 +
<div  style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<addTimeCoverage
 +
    datasetNameMatchPattern="([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})_gfs_211.nc$"
 +
    startTimeSubstitutionPattern="$1-$2-$3T$4:00:00"
 +
    duration="60 hours"
 +
/>
 +
</source>
 +
</div>
 +
: for the dataset named '''2005071812_gfs_211.nc''', results in the following timeCoverage element:
 +
<div  style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<timeCoverage>
 +
    <start>2005-07-18T12:00:00</start>
 +
    <duration>60 hours</duration>
 +
  </timeCoverage>
 +
</source>
 +
</div>
  
'''3.''' The path attribute in the <''datasetScan''> element appears in the URL after the servlet name, and '''MUST''' be the same as the value of the location attribute with the leading "''/bes/''" removed. In other words it '''MUST NOT''' start with a "/" character .
+
==== addProxies ====
 +
: For real-time data you may want to have a special link that points to the "latest" data in the collection. Here, latest is simply means the last filename in a list sorted by name, so its only the latest if the time stamp is in the filename and the name sorts correctly by time.
 +
<div  style="padding-left: 20px;width: 95%;">
 +
<source lang="xml">
 +
<addProxies>
 +
    <simpleLatest name="simpleLatest" />
 +
    <latestComplete name="latestComplete" lastModifiedLimit="60.0" />
 +
</addProxies>
 +
</source>
 +
</div>

Latest revision as of 12:21, 29 April 2015

1 Overview

Hyrax now uses its own implementation of the THREDDS catalog services and supports most of the THREDDS catalog service stack. The implementation relies on two DispatchHandlers in the OLFS and utilizes XSLT to provide HTML versions (presentation views) for human consumption.

  1. Dynamic THREDDS catalogs for holdings provided by the BES are provided by the opendap.bes.BESThreddsDispatchHandler.
  2. Static THREDDS catalogs are provided by the opendap.threddsHandler.StaticCatalogDispatch. The static catalogs allow catalog "graphs" to be decoupled from the filesystem "graph" of the data holdings, thus allowing data providers the ability to present and organize data collections independently of how they are organized in the underlying filesystem.

Static THREDDS catalogs are "rooted" in a master catalog file, catalog.xml, located in the (persistent) content directory for the OLFS (Typically $CATALINA_HOME/content/opendap). The default catalog.xml that comes with Hyrax contains a simple catalogRef element that points to the dynamic THREDDS catalogs generated from the BES holdings. The default catalog example also contains a (commented out) datasetScan element that provides (if enabled) a simple demonstration of the datasetScan capabilities. Additional catalog components may be added to the catalog.xml file to build (potentially large) static catalogs.

  • THREDDS datasetScan elements are now fully supported and can be used as a tool for altering the catalog presentation of any part of the BES catalog. These alterations include (but are not limited too) renaming, auto proxy generation, filtering, and metadata injection.

More details about the handlers, their configuration options, and other information can be found here.


1.1 THREDDS Catalog Documentation

Rather than provide an exhaustive explanation of the THREDDS catalog functionality and configuration I will appeal to the existing documents provided by our fine colleagues at UNIDATA:



2 Configuration Instructions

  • The current default (olfs.xml) file comes with THREDDS configured correctly.
  • The THREDDS master catalog is stored in the file $CATALINA_HOME/content/opendap/catalog.xml it can be edited to provide additional static catalog access.

3 datasetScan Support

The datasetScan element is a powerful tool that can be used to sculpt the catalog's presentation of the BES catalog content. The Hyrax implementation has a couple of key points that need to be considered when developing an instance of the datasetScan element in your THREDDS catalog.

3.1 location attribute

The location attribute specifies the place in the BES catalog graph that the datasetScan will be rooted. This value must be expressed relative to the BES catalog root (BES.Catalog.catalog.RootDirectory) and not in terms of the underlying BES host file system.

Example
If BES.Catalog.catalog.RootDirectory=/usr/share/hyrax and the data directory to which you wish to apply the datasetScan is (in filesystem terms) located at /Users/share/hyrax/data/nc then the associated datasetScan element's location attribute would have a value of /data/nc
<datasetScan name="DatasetScanExample" path="hyrax" location="/data/nc">

3.2 name attribute

The name attribute specifies name that will be used to in the presentation (HTML) view for the catalog containing the datasetScan is viewed.

3.3 path attribute

The path attribute specifies the place in the THREDDS catalog graph that the datasetScan will be rooted. In effect it is a relative URL for the service. If path begins with a "/" then it is an absolute path - rooted at the server and port of the web server. The values of the path attribute should NEVER contain "catalog.xml" or "catalog.html". The service will create these endpoints dynamically.

Relative path example
Consider a catalog accessed with the URL: http://localhost:8080/opendap/thredds/v27/Landsat/catalog.xml and that contains this datasetScan element:

<datasetScan name="DatasetScanExample" path="hyrax" location="/data/nc" /> </source>

In the client catalog the datasetScan becomes this catalogRef element:
<thredds:catalogRef
    name="DatasetScanExample"
    xlink:title="DatasetScanExample"
    xlink:href="hyrax/catalog.xml"
    xlink:type="simple"
/>


And the top of datasetScan catalog graph will be found at the URL
http://localhost:8080/opendap/thredds/v27/Landsat/hyrax/catalog.xml
Absolute path examples
Consider a catalog accessed with the URL: http://localhost:8080/opendap/thredds/v27/Landsat/catalog.xml and that contains this datasetScan element:
<datasetScan name="DatasetScanExample" path="/hyrax" location="/data/nc" />
In the client catalog the datasetScan becomes this catalogRef element:
<thredds:catalogRef
     name="DatasetScanExample"
     xlink:title="DatasetScanExample"
     xlink:href="/hyrax/catalog.xml"
     xlink:type="simple"
/>
Then the top of datasetScan catalog graph will be found at the URL
http://localhost:8080/hyrax/catalog.xml
Which is probably not what you want! This catalogRef directs the catalog crawler away from the Hyrax THREDDS service and to an undefined (as far as Hyrax is concerned) endpoint, one that most likely will generate a 404 (Not Found) response from the Web Server.
When using absolute paths you must be sure to prefix the path with the Hyrax THREDDS service path or you will direct the clients away from the service. In these examples the Hyrax THREDDS service path would be /opendap/thredds/" (look at the URLs in the above examples) If we change the datasetScan path attribute value to /opendap/thredds/myDatasetScan:
<datasetScan name="DatasetScanExample" path="'/opendap/thredds/myDatasetScan" location="/data/nc" />
In the client catalog the datasetScan becomes this catalogRef element:
<thredds:catalogRef
    name="DatasetScanExample"
    xlink:title="DatasetScanExample"
    xlink:href="/opendap/thredds/myDatasetScan/catalog.xml"
    xlink:type="simple"
/>
Now the top of datasetScan catalog graph will be found at the URL
http://localhost:8080/opendap/thredds/myDatasetScan/catalog.xml
which keeps the URL referencing the Hyrax THREDDS service and not some other part of the web service stack.

3.4 useHyraxServices attribute

The Hyrax version of the datasetScan element employs the extra attribute useHyraxServices. This allows the datasetScan to automatically generate Hyrax data services definitions and access links for datasets in the catalog. The datasetScan can be used to augment the list of services (when useHyraxServices is set to true) or it can be used to completely replace the Hyrax service stack (when useHyraxServices is set to false).

  • If no services are referenced in the datasetScan and useHyraxServices is set to true, then Hyrax will provide catalogs with service definitions and access elements for all the datasets that the BES identifies as data.
  • If no services are referenced in the datasetScan and useHyraxServices is set to false, then the catalogs generated by the datasetScan will have no service definitions or access elements.

By default useHyraxServices is set to true.

3.5 Functions

DatasetScan allows you to apply the following functions to the names of the datasets in the datasetScan catalog graph.

3.5.1 Filter

A datasetScan element can specify which files and directories it will include with a filter element (also see THREDDS server catalog spec for details). The filter element allows users to specify which datasets are to be included in the generated catalogs. A filter element can contain any number of include and exclude elements. Each include or exclude element may contain either a wildcard or a regExp attribute. If the given wildcard pattern or regular expression matches a dataset name, that dataset is included or excluded as specified. By default, includes and excludes apply only to atomic datasets (regular files). You can specify that they apply to atomic and/or collection datasets (directories) by using the atomic and collection attributes.
<filter>
    <exclude wildcard="*not_currently_supported" />
    <include regExp="/data/h5/dir2" collection="true" />
</filter>

3.5.2 Sort

Datasets at each collection level are listed in ascending order by name. With a sort element you can specify that they are to be sorted in reverse order:
<sort>
    <lexigraphicByName increasing="false" />
</sort>

3.5.3 Namer

If no namer element is specified, all datasets are named with the corresponding BES catalog dataset name. By adding a namer element, you can specify more human readable dataset names.
<namer>
    <regExpOnName regExp="/data/he/dir1" replaceString="AVHRR" />
    <regExpOnName regExp="(.*)\.h5" replaceString="$1.hdf5" />
    <regExpOnName regExp="(.*)\.he5" replaceString="$1.hdf5_eos" />
    <regExpOnName regExp="(.*)\.nc" replaceString="$1.netcdf" />
</namer>

3.5.4 addTimeCoverage

A datasetScan element may contain an addTimeCoverage element. The addTimeCoverage element indicates that a timeCoverage metadata element should be added to each dataset in the collection and describes how to determine the time coverage for each dataset in the collection.
<addTimeCoverage 
    datasetNameMatchPattern="([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})_gfs_211.nc$"
    startTimeSubstitutionPattern="$1-$2-$3T$4:00:00"
    duration="60 hours"
/>
for the dataset named 2005071812_gfs_211.nc, results in the following timeCoverage element:
 <timeCoverage>
    <start>2005-07-18T12:00:00</start>
    <duration>60 hours</duration>
  </timeCoverage>

3.5.5 addProxies

For real-time data you may want to have a special link that points to the "latest" data in the collection. Here, latest is simply means the last filename in a list sorted by name, so its only the latest if the time stamp is in the filename and the name sorts correctly by time.
<addProxies>
    <simpleLatest name="simpleLatest" />
    <latestComplete name="latestComplete" lastModifiedLimit="60.0" />
</addProxies>