BES File Out NetCDF: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
Line 84: Line 84:


=== Added attributes ===
=== Added attributes ===
[[User:Pwest|pwest]] 14 January, 2009 - This feature will not be added as part of 1.5, but a future project.
[[User:Pwest|pwest]] 14 January, 2009 - This feature will not be added as part of 1.5, but a future release.


After doing some kind of translation, whether with constraints, aggregation, file out, whatever, we need to add information to the resulting data product telling how we came about this result. Version of the software, version of the translation (file out), version of the aggregation engine, whatever. How do we do that?
After doing some kind of translation, whether with constraints, aggregation, file out, whatever, we need to add information to the resulting data product telling how we came about this result. Version of the software, version of the translation (file out), version of the aggregation engine, whatever. How do we do that?

Revision as of 21:55, 14 January 2009

General Questions and Assumptions

  • What version of netCDF will this support?

jimg 11:39, 21 December 2008 (PST) Initially we should support netCDF version 3

  • Should I traverse the data structure to see if there are any sequences?

jimg 17:53, 18 December 2008 (PST) Yes. An initial version should note their presence and add an attribute noting that they have been elided.

How to flatten hierarchical types

For a structure such as

Structure {
    Int x;
    Int y;
} Point;

represent that as

Point.x
Point.y

Explicitly including the dot seems ugly and like a kludge and so on, but it means that the new variable name can be feed back into the server to get the data. That is, a client can look at the name of the variable and figure out how to ask for it again without knowing anything about the translation process.

Because this is hardly a lossless scheme (a variable might have a dot in its name...), we should also add an attribute that contains the original real name of the variable - information that this is the result of a flattening operation, that the parent variable was a Structure, Sequence or Grid and its name was xyz. Given that, it should be easy to sort out how to make a future request for the data in the translated variable.

This in some way obviates the need for the dot, but I think we should use that regardless.

Attributes of flattened types/variables

If the structure Point has attributes, those should be copied to both the new variables (Point.x and Point.y). It's redundant but this way the recipient of the file gets all of the information held in the original data source. jimg 10:14, 6 January 2009 (PST) Added based on email from Patrick.

Extra data to be included

For a file format like netCDF it is possible to include data about the source data using it's original data model as expressed using DAP. We could then describe where each variable in the file came from. This would be a good thing if we can do it in a light-weight way. I think it would also be a good thing to add an attribute to each variable that names where in the original data it came from so that client apps & users don't have to work too hard to sort out what has been changed to make the file.

Information About Specific Types

Strings

  • Add dimension representing the max length of the string with name varname_len.
  • For scalar there will be a dimension for the length and the value written using nc_put_vara_text with type NC_CHAR
  • For arrays add an additional dimension for the max length and the value written using nc_put_vara_text with type NC_CHAR

pwest 14:31, 7 January 2008 (MST) Received message from Russ Rew

Yes, that's fine and follows a loose convention for names of string-length dimensions for netCDF-3 character arrays.

For netCDF-4 of course, no string-length dimension is needed, as strings are supported as a netCDF data type.

Structures

  • Flatten
  • prepend name of structure with a dot followed by the variable name. Keep track as there might be embedded structures, grids, etc...

jimg 17:53, 18 December 2008 (PST) Use the procedure described above in How to flatten hierarchical types.

jimg 17:53, 18 December 2008 (PST) I would use a dot even though I know that dots in variable names are, in general, a bad idea. If we use underscores then it maybe hard for clients to form a name that can be used to access values from a server based on the information in the file.

Grid

  • Flatten.
  • Use the name of the grid for the array of values
  • prepend the name of the grid plus an underscore to the names of each of the map vectors. jimg 11:31, 21 December 2008 (PST) A more sophisticated version might look at the values of two or more grids that use the same names and have the same type (e.g., Float64 lon[360]) and if they are the same, make them shared dimensions.

Array

  • write_array appears to be working just fine.
  • If array of complex types?

pwest 16:43, 8 January 2008 (MST) - DAP allows for the array dimensions to not have names, but NetCDF does not allow this. If the dimension name is empty then create the dimension name using the name of the variable + "_dim" + dim_num. So, for example, if array a has three dimensions, and none have names, then the names will be a_dim1, a_dim2, a_dim3.

Sequences

  • For now throw an exception jimg 11:31, 21 December 2008 (PST) Initial version should elide these I think because there are important cases where they appear as part of a dataset but not the main part. We can represent these as arrays easily in the future.

jimg 11:39, 21 December 2008 (PST) To translate a Sequence, there are several cases to consider:

  1. A Sequence of simple types only (which means a one-level sequence): translate to a set of arrays using a name-prefix flattening scheme.
  2. A nested sequence (otherwise with only simple types) should first be flattened to a one level sequence and then that should be flattened.
  3. A Sequence with a Structure or Grid should be flattened by recursively applying the flattening logic to the components.

Attributes

  • Global Attributes?
    • For single container DDS (no embedded structure) just write out the global attributes to the netcdf file
    • For multi-container DDS (multiple files each in an embedded Structure), take the global attributes from each of the containers and add them as global attributes to the target netcdf file. If the value already exists for the attribute then discard the value. If not then add the value to the attribute as attributes can have multiple values.
  • Variable Attributes
    • This is the way attributes should be stored in the DAS. In the entry class/structure there is a vector of strings. Each of these strings should contain one value for the attribute. If the attribute is a list of 10 int values then there will be 10 strings in the vector, each string representing one of the int values for the attribute.
    • What about attributes for structures? Should these attributes be created for each of the variables in the structure? So, if there is a structure a with variables v1 and v2 then the attributes for a will be attributes for a_v1 and a_v2? Or are there attributes for each of the variables in the structure? Or both. jimg 10:13, 6 January 2009 (PST) See above under the information about hierarchical types.
    • For multi-dimensional datasets there will be a structure for each container, and each of these containers will have global attributes. jimg 10:13, 6 January 2009 (PST) I don't understand this statement.

Added attributes

pwest 14 January, 2009 - This feature will not be added as part of 1.5, but a future release.

After doing some kind of translation, whether with constraints, aggregation, file out, whatever, we need to add information to the resulting data product telling how we came about this result. Version of the software, version of the translation (file out), version of the aggregation engine, whatever. How do we do that?

The ideas might be not to have all of this information in, say, the GLOBAL attributes section of the data product, or in the attributes of the opendap data product (DDX, DataDDX, whatever) but instead a URI pointing to this information. Perhaps this information is stored at OPeNDAP, provenance information for the different software components. Perhaps the provenance information for this data product is stored locally, referenced in the data product, and this provenance information references software component provenance.

http://www.opendap.org/provenance?id=xxxxxx

might be something referenced in the local provenance. The local provenance would keep track of:

  • containers used to generate the data product
  • constraints (server side functions, projections, etc...)
  • aggregation handler and command
  • data product requested
  • software component versions

Peter Fox mentions that we need to be careful of this sort of thing (storing provenance information locally) as this was tried with log information. Referencing this kind of information is dangerous.