DMR++

From OPeNDAP Documentation
Revision as of 10:56, 28 May 2021 by Ndp (talk | contribs) (Created page with "=The dmr++ Experience= ''How to build & deploy dmr++ files for Hyrax'' ==Table of Contents== * What and Why * How It Works * Supported formats and data compatibility * Bui...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
⧼opendap2-jumptonavigation⧽

The dmr++ Experience

How to build & deploy dmr++ files for Hyrax

Table of Contents

  • What and Why
  • How It Works
  • Supported formats and data compatibility
  • Build dmr++ files using the get_dmrpp tool.
  • Default Configuration
  • The H5.EnableCF option
  • CF and Missing Data
  • Configuration Injection

What? Why?

It is a fast and flexible way to serve data stored in S3. The dmr++ encodes the location of the data content residing in a binary data file/object (e.g., an hdf5 file) so that it can be directly accessed, without the need for an intermediate library API, by using the file with the location information. The binary data objects may be on a local filesystem, or they may reside across the web in something like an S3 bucket.


How Does It Work?

The DMR++ software reads a data file1 and builds a document that holds all of the file's metadata (the names and types of all of the variables along with any other information bound to those variables). This information is stored in a document we call the Dataset Metadata Response (DMR). The DMR++ adds some extra information to this (that's the '++' part) about where each variable can be found and how to decode those values. Another name for the DMR++ is Annotated DMR. This effectively decouples the annotated DMR (dmr++) from the location of the granule file itself. Since dmr++ files are typically significantly smaller than the source data granules they represent, they can be stored and moved for less expense. They also enable reading all of the file's metadata in one operation instead of the iterative process that many APIs require. If the dmr++ contains references to the source granules location on the web, the location of the the dmr++ file itself does not matter.

1 The software currently supports HDF5 and NetCDF4. Other formats can be supported, such as zarr.


How Does It Work?

Software that understands the dmr++ content can directly access the data values held in the source granule file, and it can do so without having to retrieve the entire file and work on it locally, even when the file is stored in a Web Object Store like S3.

If the granule file contains multiple variables and only a subset of them are needed, the dmr++ enabled software can retrieve just the bytes associated with the desired variables values.


Supported Data Formats

The dmr++ software currently works with hdf5 and netcdf-4 files. (The netcdf-4 format is a subset of hdf5 so hdf5 tools are utilized for both.)


Other formats like zarr, hdf4, netcdf-3 are not currently supported by the dmr++ software, but support could be added if needed.

hdf5, but not actually everything

The hdf5 data format is quite complex and many of the options and edge cases are not currently supported by the dmr++ software. The next few slides will explain these limitations and how to quickly evaluate an hdf5 or netcdf-4 file for use with the dmr++ software.

hdf5, but not actually everything

The hdf5 data format is quite complex and many of the options and edge cases are not currently supported by the dmr++ software. The next few slides will explain these limitations and how to quickly evaluate an hdf5 or netcdf-4 file for use with the dmr++ software.


hdf5 filters

The hdf5 format has several filter/compression options used for storing data values. The dmr++ software currently supports data that utilize the H5Z_FILTER_DEFLATE and H5Z_FILTER_SHUFFLE filters. You can find more on hdf5 filters here:

  https://support.hdfgroup.org/HDF5/doc/RM/RM_H5Z.html 

hdf5 storage layouts

The hdf5 format also uses a number of "storage layouts" that describe various structural organizations of the data values associated with a variable in the granule file. The dmr++ software currently supports data that utilize the H5D_COMPACT, H5D_CHUNKED, and H5D_CONTIGUOUS storage layouts. These are all of the storage layouts defined by the hdf5 library, but others can be added. You can find more on hdf5 storage layouts here: https://support.hdfgroup.org/HDF5/doc1.6/Datasets.html

Is my hdf5 or netcdf-4 file suitable for dmr++?

To determine the hdf5 filters, storage layouts, and chunking scheme used in an hdf5 or netcdf-4 file you can use the command: h5dump -H -p <filename> To get a human readable assessment of the file that will show the storage layouts, chunking structure, and the filters needed for each variable (aka DATASET in the hdf5 vocabulary) h5dump info:

   https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Dump 

h5dump example output

$ h5dump -H -p chunked_gzipped_fourD.h5 HDF5 "chunked_gzipped_fourD.h5" { GROUP "/" {

 DATASET "d_16_gzipped_chunks" {
    DATATYPE  H5T_IEEE_F32LE
    DATASPACE  SIMPLE { ( 40, 40, 40, 40 ) / ( 40, 40, 40, 40 ) }
    STORAGE_LAYOUT {
       CHUNKED ( 20, 20, 20, 20 )
       SIZE 2863311 (3.576:1 COMPRESSION)
    }
    FILTERS {
       COMPRESSION DEFLATE { LEVEL 6 }
    }
    FILLVALUE {
       FILL_TIME H5D_FILL_TIME_ALLOC
       VALUE  H5D_FILL_VALUE_DEFAULT
    }
    ALLOCATION_TIME {
       H5D_ALLOC_TIME_INCR
    }
 }

} }


Is my netcdf file netcdf-3 or netcdf-4?

It is an unfortunate state of affairs that the file suffix ".nc" is the commonly used naming convention for both netcdf-3 and netcdf-4 files. You can use the command "ncdump -k <filename>" to determine if a netcdf file is either classic netcdf-3 (classic) or netcdf-4 (netCDF-4)**

netcdf guide:

   http://www.bic.mni.mcgill.ca/users/sean/Docs/netcdf/guide.txn_79.html
    • For this to work the netcdf library must be installed on the system upon which the command is issued.

Building dmr++ files with get_dmrpp

The application that builds the dmr++ files is called get_dmrpp. It in turn utilizes other executables build_dmrpp, reduce_mdf, and merge_dmrpp along with a number of UNIX shell commands. You can see the get_dmrpp usage statement with the command: get_dmrpp -h

Using get_dmrpp

The way that get_dmrpp is invoked controls the way that the data are ultimately represented in the resulting dmr++ file(s).


The get_dmrpp application utilizes software from the Hyrax data server to produce the base DMR document which is used to construct the dmr++ file.


The Hyrax server has a long list of configuration options, several of which can substantially alter the the structural and semantic representation of the dataset as seen in the dmr++ files generated using these options.


Using get_dmrpp, Existing Hyrax DAACs

If your group is already serving data with Hyrax and the data representations that are generated by your Hyrax server are satisfactory, then a careful inspection of the localized configuration, typically held in /etc/bes/site.conf, will help you determine what configuration state you may need to inject into get_dmrpp.


Using get_dmrpp default configuration

Here are some of the most pertinent default configuration parameters: H5.EnableCF=true H5.EnableDMR64bitInt=true H5.DefaultHandleDimension=true H5.KeepVarLeadingUnderscore=false H5.EnableCheckNameClashing=true H5.EnableAddPathAttrs=true H5.EnableDropLongString=true H5.DisableStructMetaAttr=true H5.EnableFillValueCheck=true H5.CheckIgnoreObj=false

Using get_dmrpp, H5.EnableCF

The default get_dmrpp configuration option: H5.EnableCF = true Instructs the tool to produce Climate Forecast convention1 (CF) compatible output based on metadata found in the granule file being processed. Changing the value of H5.EnableCF to false will prevent get_dmrpp from attempting to make the results CF compliant. This will also make visible the Group hierarchies (if any) in the underlying data granule, these would have been suppressed otherwise, as CF does not yet support Groups.

1https://cfconventions.org/

Missing data, the CF conventions and hdf5

Many of the hdf5 files produced by NASA and others do not contain the domain coordinate data (such as latitude, longitude, time, etc.) as a collection of explicit values. Instead information contained in the dataset metadata can used to reproduce these values. In order for a dataset to be Climate Forecast (CF) compatible it must contain these domain coordinate data values. The Hyrax hdf5_handler software, utilized by the get_dmrpp application, can create this data from the dataset metadata. The get_dmrpp application places these generated data in a “sidecar” file for deployment with the source hdf5/netcdf-4 file.

get_dmrpp command line switches

The command line switches provide a way to control the output of the tool. In addition to common options like Verbose output or testing, the tool provides options to build extra (aka 'sidecar') data files that hold information needed for CF compliance if the original HDF5 data files lack that information (see the slides about 'missing data'). In addition, it is often desirable to build dmr++ files before the source data files are uploaded to S3. In this case, the URL to the data may not be known when the dmr++ is built. We support this by using a placeholder/template in the dmr++ and providing a way to substitute the URL at runtime, when the dmr++ file is evaluated. See the 'u' and 'p' options on the slides that follow.

Verbose Output Modes

-h: Show help/usage page.
-v: Verbose: Print the DMR too
-V: Very Verbose: print the DMR, the command and the configuration file used to build the DMR, and do not remove temporary files.
-D: Just print the DMR that will be used to build the DMR++


Inputs

-b: The fully qualified path to the BES_DATA_ROOT directory. May not be "/" or "/etc". The default value is /tmp if a value is not provided. All the data files to be processed must be in this directory or one of its subdirectories.
-u option is used to specify the location of the binary data object. It’s value must be an http, https, or file (file://) URL. This URL will be injected into the dmr++ when it is constructed. If option -u is not used; then the template string  OPeNDAP_DMRpp_DATA_ACCESS_URL will be used and the dmr++ will substitute a value at runtime.
-c: The path to an alternate bes configuration file to use.
-s: The path to an optional addendum configuration file which will be appended to the default BES configuration. Much like the site.conf file works for the full server deployment it will be loaded last and the settings there-in will have an override effect on the default configuration.

Output

-o: The name of the file to create.


Tests

-T: Run ALL hyrax tests on the resulting dmr++ file and compare the responses the ones generated by the source hdf5 file. -I: Run hyrax inventory tests on the resulting dmr++ file and compare the responses the ones generated by the source hdf5 file. -F: Run hyrax value probe tests on the resulting dmr++ file and compare the responses the ones generated by the source hdf5 file.

Missing data

-M: Build a 'sidecar' file that holds missing information needed for CF compliance (e.g., Latitude, Longitude and Time coordinate data). -p: Provide the URL for the Missing data sidecar file. If this is not given (but -M is), then a template value is used in the dmr++ file and a real URL is substituted at runtime. -r: The path to the file that contains missing variable information for sets of input data files that share common missing variables. The file will be created if it doesn't exist and the result may be used in subsequent invocations of get_dmrpp (using -r) to identify the missing variable file.







---