BES - Modules - CSV Handler: Difference between revisions

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽
 
(10 intermediate revisions by the same user not shown)
Line 13: Line 13:


===== Supported OpenDAP datatypes =====
===== Supported OpenDAP datatypes =====
The CSV handler supports all of the DAP2 simple types: Byte, Int16, UInt16, Int32, UInt32, Float32, Float64, Str, and Url.
The CSV handler supports the following DAP2 simple types: Int16, Int32, Float32, Float64, String.


===== Dataset representation =====
===== Dataset representation =====
Line 27: Line 27:
</source>
</source>
This is in contrast to the FreeForm handler that would represent these data as a Sequence with five columns.
This is in contrast to the FreeForm handler that would represent these data as a Sequence with five columns.
For each column, the corresponding Array in the OpenDAP dataset has one attribute named ''type'' with a string value of ''Int16'', ..., ''String''. However, see below for information on how to add custom attributes to a dataset.


=== Known problems ===
=== Known problems ===
Line 32: Line 34:


== Configuration parameters ==
== Configuration parameters ==
=== Configuring the ODBC Driver ===
To configure the handler the handler itself must be told which tables, or parts of tables, should be accessed and the ODBC driver must be configured. In general, ODBC drivers are pretty easy to configure and, while each driver has its idiosyncrasies, most of the setup is the same for any driver/database combination. Both ''unixODBC'' and ''iODBC'' use two configuration fills: ''/etc/odbcinst.ini'' and ''/etc/odbc.ini''. The driver should have documentation on these files and their setup. There is one parameter you will need to know to make use of the sql handler. In the ''odbc.ini'' file, the parameter ''database'' is used to reference the actual database that is matched to particular ''Data Source Name'' (DSN). You will need to know the DSN since programs that use ODBC to access a database use the DSN and not the name of the database. In addition, there is a ''user'' and ''password'' parameter set defined for a particular DSN; the sql handler will likely need that too (NB: This might not actually be needed 9/9/12).
What the configuration files look like on OSX:
This file likely will no change. odbcinst.ini:
<source lang="ini">
[ODBC Drivers]
MySQL ODBC 5.1 Driver = Installed
psqlODBC              = Installed
[ODBC Connection Pooling]
PerfMon    = 0
Retry Wait =
[psqlODBC]
Description = PostgreSQL ODBC driver
Driver      = /Library/PostgreSQL/psqlODBC/lib/psqlodbcw.so
[MySQL ODBC 5.1 Driver]
Driver = /usr/local/lib/libmyodbc5.so
</source>
This file holds information about the database name and the Data Source Name (DSN). Here it's creatively named 'test'. odbc.ini:
<source lang="ini">
[ODBC Data Sources]
data_source_name = test
[ODBC]
Trace        = 0
TraceAutoStop = 0
TraceFile    =
TraceLibrary  =
[test]
Description = MySQL test database
Trace      = Yes
TraceFile  = sql.log
Driver      = MySQL ODBC 5.1 Driver
Server      = localhost
User        = jimg
Password    =
Port        = 3306
DATABASE    = test
Socket      = /tmp/mysql.sock
</source>


=== Configuring the handler ===
=== Configuring the handler ===
==== SQL.CheckPoint ====
This handler has no specific configuration parameters.
Checkpoints in the SQL handler are phases of the database access process where error conditions can be tested for and reported. If these are activated using the ''SQL.CheckPoint'' parameter and an error is found, then a message will be printed in the bes.log and an exception will be thrown. There are five checkpoints supported by the handler:
;CONNECT: 1 (Fatal error)
;CLOSE: 2
;QUERY: 3
;GET_NEXT: 4 (Recoverable error)
;NEXT_ROW: 5
 
The default for the handler is to test for and report all errors:
SQL.CheckPoint=1,2,3,4,5


=== Configuring Datasets ===
=== Configuring Datasets ===
One aspect of the SQL handler that sets it appart from other handlers is that the datasets it serves are not files or collections of files. Instead they are values read from one or more tables in a database. The handler uses one file for each dataset it serves; we call them ''dataset files''. Within a dataset file there are several sections that define which Data Set Name (DSN) to use (recall that the DSN is set in the ''odbc.ini'' file which maps the DSN to a particular database, user and password), which tables, how to combine them and which columns to ''select'' and if any other constraints should be applied when retrieving the values from the database server. As a data provider, you should plan on having a dataset file for each dataset you want people to access, even if those all come from the same table.  
There are two ways to add custom attributes to a dataset. First, you can use an ''ancillary attribute'' file in the same directory as the dataset. Alternatively, you can use NcML to add new attributes, variables, etc. See [[BES_-_Modules_-_NcML_Module | the NcML Handler documentation]] for more information on that option. Here we will describe how to set up an ancillary attribute file.


A dataset file has five sections:
===== Simple attribute definitions =====
;section: This is where the DSN and other information are given
For any OpenDAP dataset, it is possible to write an ancillary attributes file like the following. If that file has the name ''dataset''.<tt>das</tt> then whenever Hyrax reads ''dataset'', it will also read those attributes, and return them when asked.  
;select: Here the arguments to passed to select are given. This may be ''*'' or the names of columns, just as with an SQL ''SELECT'' statement
<source lang="c">
;from: The names of the tables. This is just like the ''FROM'' part of an SQL ''SELECT'' statement.
Attributes {
;where: You're probably seeing a pattern by now: SELECT ... FROM ... WHERE
  Station {
;other: Driver-specific parameters
      String bouy_type "flashing";
 
      Byte Age 53;
Each of the sections is denoted by starting a line in the dataset file with its name in square brackets such as:
  }
[section]
  Global {
or
      String DateCompiled "11/17/98";
[select]
      String Conventions "CF-1.0", "CF-1.6";
 
  }
==== Information in the ''section'' part of the dataset file ====
}
There are six parameters that may be set in the ''select'' part of the dataset file:
</source>
;api: Currently this must be ''odbc''
The format of this file is very simple: Each variable in the dataset may have a collection of attributes, each of which consists of a type, a name and one or more values. In the above example, the variable ''Station'' will have the additional attributes ''bouy_type'' and ''Age'' with the respective types and values. Note that datasets may also define global attributes - information about the dataset as a whole - by adding a section with a name that doesn't match the name of any variable in the dataset. In this example, I used ''Global'' for this (because it's obvious) but I could have used ''foo''. Also note the attribute ''Conventions'' has two values, ''CF-1.0'' and 'CF-1.6''.
;server: The DSN.
;user, pass, dbname, port: Unused. These are detected by the code, however, and can be used by a new submodule that connects to a database using a scheme other than ODBC. For example, if you were to specialize the connection mechanism so that it used a database's native API, these keywords could be used to set the database name, user, etc., in place of the ODBC DSN. In that case the value of ''api'' would need to be the base name of the new connection specialization.
 
Note that a dataset file may have several [section] parts, each which lists a different DSN. This provides a failover capability so that if the same information (or similar enough to be accessible using the same SQL statement) exists both locally and remotely, both sources can be given. For example, suppose that your institution maintains a database with many thousands of observations and you want to serve a subset of those. You have a copy of those data on your own computer too, but you would rather have people access the data from the institution's high performance hardware. You can list both DSNs, knowing that the first listed will get preference.
 
==== The ''select'' part ====
This part lists the columns to include as you would write them in an SQL SELECT statement.
Each column name has to be unique. You can use aliases (defined in the preamble of the dataset file) to define different names for two columns from different database tables that are the same. For example, you could define aliases like these:
table1.theColumn as col1
table2.theColumn as col2
and then use ''col1,col2'' in the select part of the dataset file
 
==== The ''from'' and ''where'' parts ====
Each of these parts are simply substituted and passed to the database just as you would expect. Note that you do not include the actual words ''FROM'' or ''WHERE'', just the contents of those parts of the SQL statement.
 
==== The ''other'' part ====
Entries in this parts should be of the form ''key = value'', one per line. They are taken as a group and passed to the ODBC driver. Use this section to provide any parameters that are specific to a particular driver.
 
==== Using variables ====
The dataset files also support 'variables' that can be used to define a name once and then use it repeatedly by simply using the variable name instead. Then if you decide to read from a different table, only the variable definition needs to be changed. Variables are defined as the beginning o the dataset file, before the ''section'' part. The syntax for variable is simple: ''define $variable$ = value'', one per line (the ''$'' characters are literal, as is the word ''define''). To reference a variable, use ''$variable$'' wherever you would otherwise use a literal.
 
==== Some example dataset files ====
 
<pre>
[section]
#  Required.
api=odbc
 
# This is the name of the configured DSN
server=MySQL_DSN
 
[select]
# The attribute list to query
# NOTE: The order used here will be kept in the results
id, wind_chill, description
 
[from]
# The table to use can be a complex FROM clause
wind_08_2010
 
[where]
# this is optional constraint which will be applied to ALL
# the requests and can be used to limit the shared data.
id<100
</pre>

Latest revision as of 22:43, 1 May 2015

Kinds of files the handler will serve

This handler will serve Comma-Separated Values type data. Form many kinds of files, only very modifications to the data files are needed. If you have very complex ASCII data (e.g., data with headers), take a look at the FreeForm handler, too.

Data file configuration

Given a simple CSV data file, such as would be written out by Excel, add a single line at the start that provides a name and OpenDAP datatype for each column. Just as the data values in a given row are separated by a comma, so are the column names and types. Here is a small example data file with the added name<type> configuration row.

"Station<String>","latitude<Float32>","longitude<Float32>","temperature_K<Float32>","Notes<String>"
"CMWM",-34.7,23.7,264.3,
"BWWJ",-34.2,21.5,262.1,"Foo"
"CWQK",-32.7,22.3,268.4,
"CRLM",-33.8,22.1,270.2,"Blah"
"FOOB",-32.9,23.4,269.69,"FOOBAR"
Supported OpenDAP datatypes

The CSV handler supports the following DAP2 simple types: Int16, Int32, Float32, Float64, String.

Dataset representation

The CSV handler will return represent the columns in the dataset as arrays with the named dimension record. For example, the sample data shown above will be represented in DAP2 by this handler as:

Dataset {
    String Station[record = 5];
    Float32 latitude[record = 5];
    Float32 longitude[record = 5];
    Float32 temperature_K[record = 5];
    String Notes[record = 5];
} temperature.csv;

This is in contrast to the FreeForm handler that would represent these data as a Sequence with five columns.

For each column, the corresponding Array in the OpenDAP dataset has one attribute named type with a string value of Int16, ..., String. However, see below for information on how to add custom attributes to a dataset.

Known problems

There are no known problems.

Configuration parameters

Configuring the handler

This handler has no specific configuration parameters.

Configuring Datasets

There are two ways to add custom attributes to a dataset. First, you can use an ancillary attribute file in the same directory as the dataset. Alternatively, you can use NcML to add new attributes, variables, etc. See the NcML Handler documentation for more information on that option. Here we will describe how to set up an ancillary attribute file.

Simple attribute definitions

For any OpenDAP dataset, it is possible to write an ancillary attributes file like the following. If that file has the name dataset.das then whenever Hyrax reads dataset, it will also read those attributes, and return them when asked.

Attributes {
   Station {
      String bouy_type "flashing";
      Byte Age 53;
   }
   Global {
       String DateCompiled "11/17/98";
       String Conventions "CF-1.0", "CF-1.6";
   }
}

The format of this file is very simple: Each variable in the dataset may have a collection of attributes, each of which consists of a type, a name and one or more values. In the above example, the variable Station will have the additional attributes bouy_type and Age with the respective types and values. Note that datasets may also define global attributes - information about the dataset as a whole - by adding a section with a name that doesn't match the name of any variable in the dataset. In this example, I used Global for this (because it's obvious) but I could have used foo. Also note the attribute Conventions has two values, CF-1.0 and 'CF-1.6.