Use cases for swath and time series aggregation: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 63: Line 63:


==== Issues: DAP4 ====
==== Issues: DAP4 ====
;Sequence:serialize must be specialized: '''easy to moderate'''
;<nowiki>D4Sequence::serialize</nowiki> must be specialized: ''easy to moderate''
:Along the lines of the DAP2 case, with the only real difference that the DAP4 sequence code is much simpler. How hard depends on if there are issues getting the evaluator to call the specialized D4Sequence::serialize
:Along the lines of the DAP2 case, with the only real difference that the DAP4 sequence code is much simpler.
;CE Filters have yet to be implemented in DAP4: '''moderate'''
;CE Filters have yet to be implemented in DAP4: ''moderate''
:Must implement grammar, evaluator.
:Must implement grammar, evaluation.
;I'm not sure where in the code the filter operation will be performed: '''spike'''
;I'm not sure where in the code the filter operation will be performed: ''spike''
:However, in DAP4, a server function is defined as building a new DMR to which the CE is then applied. Not sure how containers will affect this.
:However, in DAP4, a server function is defined as building a new DMR to which the CE is then applied. Not sure how containers will affect this.
;The aggregationServer must be written: '''easy'''
;The aggregationServer must be written: ''easy''
:...and likely the same code as for DAP2
:...and likely the same code as for DAP2
;Containers are not yet supported by the DMR class: '''moderate'''
;Containers are not yet supported by the DMR class: ''moderate''
:Could copy the DDS implementation...
:Could copy the DDS implementation...

Revision as of 23:51, 5 February 2015

<-- Back

Use cases for satellite Swath and Time Series aggregation. Our general approach is to use the Sequence data type to aggregate granules from Swath and Time series data sets (with themselves, not to mix the two, although the latter would be possible in general). Data will be read from arrays and loaded into a Sequence object, where it will be filtered and concatenated with other sequence objects. The result will be the aggregate. Of course, this will have to be optimized...

Sample data

  • Level 3 are easy to find. Some examples:

You can use Earthdata Search to find Level 3: https://search.earthdata.nasa.gov/search?m=0.0703125!0.140625!2!1!0!&ff=Subsetting+Services Click on the icon next to any dataset and click on the "API Endpoints" tab. That will give you the OPeNDAP endpoint.

My notes:

  • Level three data will be DAP2 Grids or DAP4 Coverages and look like they can easily be aggregated using NcML. We might think about a function that could aggregate them, but it's not in scope for this task.
  • The GLAS data are stored in one-dimensional arrays. These are time series data: HDF5_GLOBAL.featureType: timeSeries. The GLAH files are HDF5 files. The one I looked at has 1Hz, 40Hz and 0.25Hz (4s) data. for each of the sample rates, there are a d_lat, d_lon and UTCTime arrays along with a large number of dependent variables in arrays. There are also some browse images. For some of the time series data there are two dims where the second dim provides cloud layer info (that is, values were gathered for cloud top and bottom for each of 10 layers.
    • Suppose we want to aggregate a bunch of granules of these data? We can build a table of lat, lon, time[, cloud layer] and zero or more dependent variables for each granule, concatenate them and filter them. Optimizations include filtering before concatenating and (further) reading only data that would pass the filter in the first place.
    • By including a granule name, and using nested Sequences, we can include useful metadata and make it easier to transform the resulting sequence back into an array. The nested Seq could be flattened for a return (as DAP binary or CSV).
  • The MODIS data are typical MODIS L2 products with a number of dependent vars in 2D arrays and two 2D arrays, one for lat and lon.
    • We could read these data into a table with lat, lon and zero or more dependent values. Concatenate and filter. Optimizations are to read just the data needs and/or filter before concatenation. Could add granule and array index information to simplify transformation back from the seq to an array

I'm going to close this spike. The larger task in this sprint for this aggregation topic is to design the function; I'm going to write up some use cases and ask Patrick if they describe his needs.

Here are some URLs I used to get data:

Use cases

Design

There are two parts to both the CSV- and tar-ball-response solutions. As it happens, the two primary use cases - get swath data as CSV and get swath data in netCDF files - will likely be implemented using some different code in both the OLFS and BES. However, the narrative for both use cases' designs is roughly the same. A new web service endpoint will be made to process the requests. The requests will be made using HTTP POST when the list of granules will be enumerated in the request body and query parameters will supply the remaining information. Once the OLFS has parsed the information, it will use the BES to access the data and build the response, with two variations.

For the CSV response, the BES will actually access each granule and combine the data from them into a single response. In this case the OLFS will call the BES once, passing in each granule using a BES request with multiple containers, one per granule, and requesting the response as CSV or ASCII. In addition, the BES will need to pass in the array variables as parameters of a server function that can form them into a table and a constraint expression that will select the requested space-time values.

For the netCDF response, the OLFS will iterate over the N granules, making N discrete requests for data from the named arrays within a space-time ROI. For each request, the OLFS will specify the return format as 'netCDF'. It will then collect the resulting netCDF response documents and bundle them using tar/gz or zip and return the result to the client.

Alternate version for the netCDF return implementation: We may be able to use the BES stored result feature to eliminate the multiple trips to the BES. More investigation is needed.

More detail about the two responses

Here is a Sequence diagram for the BES code used to build the CSV response:

BES Array to Sequence.png

In the figure, the BES iterates over each (of N) container and calls a function that takes M arrays and forms them into a single Sequence (DAP2 or DAP4; both are implemented now). One important feature that is not shown in the diagram is that all of the data from the arrays are read into the sequence object, to be filtered later. It's possible that a later optimization might drop that, and some code used to build the netCDF response might help with that - see below. The initial version of the function will read all of the data from the arrays passed to it, assuming some later step will be used to filter them.

Once the data values are all read, each of the sequences is wrapped in a Structure (this is how the BES represents containers). The AggregationServer will then take all of the containers, get their sequences, and merge them into a single sequence.

The single sequence will be the sole top-level variable in the response DDS/DMR. The ResponseBuilder object will serialize it, routing the data through a CSV/ASCII transmitter.

Issues: DAP2

Sequence::serialize will need to be specialized
hard
This is required to handle the case where other code loads all of the data values (the current version assumes that each row is read one at a time and the selection criteria applied then. Rows that don't match the selection criteria are never part of the Sequence object). The function, however, assumes that the sequence will filter out the unwanted values after they have been loaded into memory. This could be quite complex.
The AggregationServer must be written
easy
That the DDS with N Structures, extract the Sequences and concatenate them.
I'm not sure where in the DAP2 the CE Selection will be performed
spike
This could be important

Issues: DAP4

D4Sequence::serialize must be specialized
easy to moderate
Along the lines of the DAP2 case, with the only real difference that the DAP4 sequence code is much simpler.
CE Filters have yet to be implemented in DAP4
moderate
Must implement grammar, evaluation.
I'm not sure where in the code the filter operation will be performed
spike
However, in DAP4, a server function is defined as building a new DMR to which the CE is then applied. Not sure how containers will affect this.
The aggregationServer must be written
easy
...and likely the same code as for DAP2
Containers are not yet supported by the DMR class
moderate
Could copy the DDS implementation...