Revision as of 13:14, 26 January 2008

Using the Toolkit

This chapter describes how to use the toolkit software to build new client libraries and data servers. Before beginning to build either part of a new OPeNDAP application, it is very important to be intimate with the details of the API to be replaced.

To create a client library that can replace the original API implementation at link time means that the client library must present exactly the same interface as the original library. This includes, to the extent that they are widely used, any undocumented features of the original implementation that manifest themselves as symbols that require link-time resolution. Building a client-library requires great understanding of the existing implementation as well as current use of the target API.

To build a good data server for files or data sets encoded using an API it is important to understand the data model(s) the API supports and how they relate to the OPeNDAP data models. Each of the various data types that the API supports must be translated into a OPeNDAP data type (i.e., one of the OPeNDAP classes that descend from BaseType). However, there is often not a one-to-one match between the API's types and the OPeNDAP types. Thus, the data server author must decide how to best translate the API's types into OPeNDAP types so as to preserve as much of the data set author's intent. This is exacerbated by the use of various conventions that (implicitly) bind several variables together with a data set. When this pattern shows up (as it does with NetCDF) you must decide whether to lump all variables together that appear to use the convention (and thus falsely group some variables) or to group only those which actually are explicitly grouped using whatever the API provides. If you choose the latter then any data sets which follow the convention will lose information. When building the data server it is important to keep such tradeoffs in mind.

The following sections discuss the specifics of building a data server and a client library. The existing NetCDF server and client library are used as examples. Many APIs are very similar in their overall organization. The source code used for these examples can be found in $(OPeNDAP_ROOT)/src/nc-dods/. Much of the NetCDF example will be relevant to your task, even if your target API is significantly different. The $(OPeNDAP_ROOT)/src/jg-dods/ directory contains both a data server and client library for the Joint Geoghsical Ocean Flux Study relational data system.

Data Servers

The OPeNDAP data server consists of a dispatch program and a set of filter programs. The dispatch program reads the incoming URL and decides which of the filter programs to run based on the URL suffix.

A typical OPeNDAP data request uses three filters: one to return the DAS (.das), one for the DDS (.dds), and the third for the data (.dods). A client can also request ASCII data (.asc or .ascii), usage information about the server (.info), or version information about the server and the data (.ver).

The task of building a OPeNDAP server can then be separated into the following steps:

Create concrete classes of the entire BaseType hierarchy, with read functions for each data type. Certain APIs cannot handle certain OPeNDAP types. For these types, there must still be a concrete class, but it can have a read method with a null body.
Write functions that use the native API to extract from the dataset the information needed to build the OPeNDAP DAS and DDS objects, and then build them with the methods those classes provide.
NOTE: This step has nothing at all to do with OPeNDAP. This is between you and your data. OPeNDAP makes no demands on how these structures are created. That is, for example, if all the data to be served has the same DDS, feel free to cheat. The only thing that is important is that the structures accurately reflect the relationships of the data.
Create filter programs to return the DAS , DDS, data, and server usage and version information.
Create a dispatch program to parse an incoming URL and invoke the correct filter program.

To install the finished server, put the filter programs into a web server's CGI directory, and put the datasets to be served somewhere they can be seen by those filter programs. Refer to the The OPeNDAP User Guide for more details about installing a server.

The Dispatch CGI

The OPeNDAP dispatch CGI program receives a data request from the OPeNDAP client, and dispatches the request to one of several filter programs. The dispatch CGI is stored in a CGI directory on the host machine. Its name is an important detail of its operation. The name should begin with nph-, and end with the letters that distinguish data files containing data formatted with that API from other files.(8) So, for example, NetCDF data files are called \var{foo}.nc, so the NetCDF dispatch CGI is called nph-nc.

The dispatch CGI's job is to parse the incoming URL and execute the appropriate filter programs with the arguments enclosed in the URL. The dispatch CGI is also be responsible for the first level of error information that must be returned to the user. These tasks are easily accomplished in any scripting language. On the off chance you wish to use Perl, OPeNDAP provides a Perl class designed to make writing the CGI a simple task.

The file OPeNDAP_Dispatch.pm contains the definitions of the OPeNDAP_Dispatch class. This class provides several methods used to parse the incoming URL, and one method for delivering error messages to the client. The OPeNDAP_Dispatch provides the following methods:

command(): Returns the command string implied by the input URL. The command string looks like:

: command filename -e query-string.

: Where command is the OPeNDAP filter program to be run, filename is the absolute filename of the dataset on which to run it, and query-string is the constraint expression that was enclosed in the URL. Of the OPeNDAP_dispatch methods, many dispatch CGI scripts may only need to use this one and print_error_msg. See Figure 4.1.1

query(): Returns the query string from the URL. This is the OPeNDAP constraint expression.

filename(): Returns the absolute filename corresponding to the requested dataset.

extension(): Returns the extension on the end of the URL. For OPeNDAP, this will be das, dds, dods, info, or ver.

cgi-dir(): Returns the absolute pathname of the directory in which the dispatch CGI is stored. This is generally the same as

the directory in which the OPeNDAP filter programs are stored.

script(): Returns the name of the dispatch CGI, minus the nph-, and any suffixes used for a secure server.

print_error_message(ver): This returns an error message to the client, explaining how to use the server. The ver argument should be a string containing the version of the server software. The error message returned is encoded in the OPeNDAP_Dispatch.pm file.

print_help_message(): This returns a help message to the client. This can be issued in response to a confusing or inadequate URL. The help message returned is encoded in the OPeNDAP_Dispatch.pm file.

A sample (simple) OPeNDAP dispatch CGI is shown in Figure 4.1.1. This is a Perl script using the OPeNDAP_Dispatch methods. This script assumes that all data is rooted in the http document directory subtree.(9)

#!/usr/local/bin/perl

use Env;
use OPeNDAP_Dispatch;

$dispatch = new OPeNDAP_Dispatch;

<math>command = </math>dispatch->command();

if ($command ne "") {           # if no error...
    exec($command);
} else {
    my <math>script_rev = '</math>Revision: 11906 $ ';

<math>script_rev =~ s@$([A-z]*): (.*) $@</math>2@;

    <math>dispatch->print_error_msg(</math>script_rev);
}

A simple OPeNDAP data server dispatch CGI.

The DAS and DDS filter programs

The simplest way to learn about creating a new filter program to return a dataset's DAS or DDS is to examine the existing filter programs. In this section, we will examine the NetCDF servers.

The source code for the DAS filter program distributed with the NetCDF server software is shown in Figure 4.1.2. The DAS and DDS filters are very similar, so only the DAS filter will be discussed here. The important differences between the two will be pointed out.

The CGI dispatch program makes heavy use of commonly used functions collected in the OPeNDAP_Dispatch class. In the same way, the OPeNDAPFilter class collects several commonly used functions for the construction of filter programs. The example program uses several methods of that class. Other useful utility functions are in the cgi-util collection.

The filter program in Figure 4.1.2 can be separated into the following steps:

line 16: Step 1: The OPeNDAPFilter class provides a constructor that parses the argument list to create the data. You can use the OK method to check that the list was parsed properly. Any errors here indicate a mistake in the dispatch CGI itself. This is why the print_usage function prints its message to the WWW server log file when it returns an error object to the client.

line 21: Step 2: If the user has only requested version information from the server, it is provided here.

line 26: Step 3: The read_variables function performs the real work of this program. This involves scanning the dataset itself for data variable attributes and using the DAS method functions to assemble the corresponding DAS. This operation is specific to the data access API in use, so does not make a good example.

line 29: Step 4: Each of the filter programs must create a MIME document to hold its return value. The DAS and DDS filters return a text MIME document; they set up the MIME headers using the utility function set_mime_text.

line 34: Step 5: Once the data set has been read and the attribute table built, the DAS ancillary file is loaded. The example filter looks for a file with the same root name as the data set and an extension of .das. If such a file exists, it is read in using the DAS member function DAS::parse and the information it contains is merged with the DAS built from the dataset.

line 37: Step 6: Finally the DAS member function print is used to send the textual representation of the DAS to the client. When it is invoked by the httpd daemon, the dispatch CGI's standard input and output are a socket connected to the remote client process. This means that since the filter is invoked by the dispatch script, its output goes directly to the client. The OPeNDAPFilter send_das method looks something like this:


 DODSFilter::send_das(DAS &das)
{
    set_mime_text(dods_das);
    das.print();

    return true;
}

#include <iostream.h>

#include "DAS.h"
#include "cgi_util.h"
#include "DODSFilter.h"

extern bool read_variables(DAS &das, 
        const char *filename, String *error);

int 
main(int argc, char *argv[])
{
    DAS das;
    DODSFilter df(argc, argv);

    if (!df.OK()) {
        df.print_usage();
        return 1;
    }

    if (df.version()) {
        df.send_version_info();
        return 0;
    }

    String errMsg;
    if(!read_variables(das, df.get_dataset_name(), &errMsg)){
      Error e(no_such_file, errMsg);
      set_mime_text(dods_error);
      e.print();
      return 1;
    }

    if (!df.read_ancillary_das(das))
        return 1;

    if (!df.send_das(das))
        return 1;

    return 0;
}

Figure 4.1.2 The DAS filter program.

Note that the example filter in Figure 4.1.2 does not use any caching. It is possible to build a more sophisticated filter program that saves the generated DAS to a text file and then uses that file without first interrogating the data set, thus saving on access. It is also possible to write a DAS by hand and always use that if the data set does not contain any of the type of information that the DAS has.

Caching DAS and DDS Objects

Because the construction of the DAS and DDS objects requires that an entire data set be scanned, it can become very inefficient to continually rebuild these objects. Because the DAS and DDS filter programs use a text representation for transmission from the server to the client, it is simple to store both the DAS and DDS objects once they have been created. Subsequent accesses to these objects can be accomplished by reading and transmitting the textual representation without actually building the binary data object.

When taking advantage of this optimization, it is important that the server check the date stamp of the DAS / DDS text objects and compare it to the latest modification date of the data set. For any dataset to which new data is periodically added, the DAS / DDS text object must clearly be updated so that the cached text object matches exactly the object that would be created if the object were built by querying the data set.

The update of the DAS / DDS text object can itself be optimized significantly. It is not actually necessary to completely re-read the entire data set. Because the software used to build both the DAS and the DDS binary objects work incrementally, it is possible to read text version of the DAS / DDS object, and then read only the new parts of the data set. The binary object will be added to as needed.

NOTE: The DAS / DDS software may not properly update
changed data (data that was present in a previous version of the data set, but is now different) nor is it straightforward to remove data which is no longer present in the data set. In these cases it is usually better to regenerate the DAS / DDS from scratch.

The Data filter

The data filter program is structured similarly to both the DAS and DDS filters except that it returns a binary MIME document rather than text and that it takes two arguments instead of just one. In addition to the data set or file name (argument 1) it also takes the OPeNDAP constraint expression (argument 2, which was enclosed in the URL's query ).

The NetCDF data filter is all but identical to the DDS filter. The only difference is that it calls the send_data method of OPeNDAPFilter to send the binary data over the network. This function calls the DDS send method.

If for some reason you cannot use the send member function of DDS, then you must ensure that the the read , CE evaluation and the serialize operations are all carried out in the correct order. Furthermore, you must ensure that the return value of the data filter is a binary MIME document with a text prefix (currently, OPeNDAP does not use the multi-part MIME standard); that is a regular binary MIME document with a section at the start that is text. This text is the DDS generated after evaluating the projection clauses of the constraint expression. The text part is separated from the data by the keyword "Data:" at the start of the line.(10).

The ASCII Data Filter

OPeNDAP is packaged with a filter to translate a OPeNDAP data stream into an ASCII data file. Clients can request ASCII data by appending .asc or .ascii to their URL instead of .dods. The asciival program is useful as a standalone client (see The OPeNDAP User Guide), but may also be used by a server to provide ASCII data.

A request for ASCII data is processed as any other request for data, but the final output of the data filter is piped into the asciival program and the result returned to the client:

nc_dods Data.nc | asciival -m -- -

The OPeNDAP_Dispatch class takes care of this step automatically, when it encounters a request using .asc or .ascii.

The Usage Filter

Client requests containing a .info suffix should return to the client HTML text containing documentation of both the server usage and the dataset named in the query. OPeNDAP provides a usage filter that can be used for this purpose. The OPeNDAP_Dispatch class invokes this filter.

The OPeNDAP-provided usage filter accepts two arguments, the data file name requested and the name of the CGI script (the dispatch CGI) in use:

usage filename CGI-name

The usage filter looks in the dataset directory for a file called filename.html, and in the directory specified in the CGI-name argument for a file called CGI-name.html. These two files must contain HTML, but without the html, head or body tags.

For example, suppose a dispatch CGI using the OPeNDAP_Dispatch class receives a URL like this:

http://dods/cgi-bin/nph-nc/data.info

In this case, the usage filter looks for two files: cgi-bin/nph-nc.html and data.html (the htdocs directory is assumed in the second case). The contents of these two files are concatenated with an HTML representation of the DAS and DDS for the data.nc file, and the whole thing is returned to the client. If the HTML files are not found, the returned document contains only the DAS and DDS.

Documenting Your Work

If you do write a server, and intend to circulate it beyond your own site, here are some guidelines for documenting that server that will help others use it.

Since there are two sets of "users" for a data server program, there are two sets of instructions that need to be prepared for a given server. One set will be read by the person who installs and maintains the server on the host platform. The other set is designed to be read by people who intend to request data from that server. These users will get this documentation by submitting queries to the Info Service, in rather the same way that many UNIX commands have a -usage option.

In addition to these two documents, all servers should include a set of text files in their distribution directory.

The README File

The README file should contain the following information:

A brief overview that describes the purpose and method of operation of the server.

The revision level of the server.

Any features the local httpd daemon must support to use this server.

Any data translations that this data server can do. If any are done, they should be described in detail, so that users can know what data they get.

The ERRORS File

The ERRORS file should contain a complete list of the error messages and explanations that might ever be issued by the server.

Installation Notes

These instructions should be included in a file called INSTALL which is to be included with the server distribution. At a minimum, they should cover the following topics:

Configuring and compiling the server code. Ideally, there should be a configure script included, but detailed instructions on editing the Makefile will often suffice. Remember to install the usage data file somewhere the server can find it.
Are there any environment variables that must be defined in order to run the server? Are there other programs (e.g. gzip that must be installed on the host machine?
What configuration options are there for the installed server? This covers issues like enabling data compression, ancillary data caching, and choosing the GUI manager program with which the server will communicate. If there are performance trade-offs associated with each option, note them here.
Ancillary data files:
- Must the installer prepare ancillary data files by hand, or are these created automatically and cached?
- If they must be created, where ought they be put?
- If they are cached, where are they kept?
- Also, if the ancillary data files are cached, what implications are there for updating the data sets served by this server? (i.e. must the ancillary data files be updated also? Deleted and recreated?)
What temporary files will be created by the server? Where will they be stored? Under what conditions may (or must they) they be erased?

Information Files

The information files contain the information that remote users of this server will use to figure out how to use this server and its datasets once it is installed somewhere. The files are used in constructing the HTML page for the info server The .info results can include information about both the server and the current dataset. (In fact, the results will usually include the DAS and DDS of the dataset named in the URL.)

When a user appends .info to a URL, the info service is activated. This service collates information about the server and the dataset (from the DAS, DDS, lists of global attributes, and variable summaries), and assembles that information in an HTML document. The server then looks for additional HTML files created by the server's administrator, and appends them the original file, and returns the whole document to the client.

Although it is possible merely to rely on the collated data to describe a server, we hope that server writers will provide rich, human-friendly descriptions of the server's usage and the accompanying datasets. These files can be thought of as "usage" or "README" files. At a minimum, they should cover:

Any special data functions defined by the server that can be used in a constraint expression, and
Any data model translations the server supports, and how they are to be controlled by the user\footnote{Remember that the "how" is to be answered very specifically, and on the user's level (i.e. "Do such-and-such, spelled like this , to make the array returned be nx5 instead of 5xn."), and not on the programmer's level (i.e. "You use the invert method to return an array of 5xn instead of nx5.")}
A list of the programs a user should have to use certain features of the server. For example, note here that the server expects that the GUI manager is running a Tcl interpreter.
A list of the error messages that the user is apt to see. Include explanations of the conditions that may have caused them, and any steps the user may take to recover from them.
The answers to any questions you are frequently asked about this server or its usage.

The usage data file need not be any more elaborate than any man page.

To create information for a server, write an HTML fragment using the format given below, and put the HTML file in the same directory as teh server. Name it using the base name of the server; for example, the HTML file that describes the netCDF server (made up of nph-nc, and nc_das, nc_dods and so on) is called nc.html.

This example shows the correct HTML tagging for server information:

<h3>
Server Function:
</h3>
<dl>
<dt>geolocate(variable, lat1, lat2, lon1, lon2)</dt>
<dd>Returns the elements of <em>variable</em> that fall
within the box created by (<em>lat1</em>,<em>lon1</em>)
and (<em>lat2</em>,<em>lon2</em>).</dd>
<p>
<dt>time(variable, start_time, stop_time)</dt>
<dd>Returns the elements of <em>variable</em> that fall
within the time interval <em>start_time</em> and
<em>stop_time</em>.</dd>
</dl>
<p>

For datasets, put the HTML file, tagged using the format given below, in the same directory as the datasets. Name it using the base name of the datasets; for example, the HTML file for fnoc1.nc, fnoc2.nc, and fnoc3.nc might be called fnoc.html. This example shows the correct HTML for a dataset information file:

<h3>
About the dataset
</h3>
This is where the server administrator would supply
information about the dataset.  And so on...
<p>

You may prefer to override this method of creating documentation and simply provide a single, complete HTML document that contains general information for the server or for a group of datasets. For example, to force the info server to return a particular HTML document for all its datasets, you would create a complete HTML document and give it the name \var{dataset}.ovr, where \var{dataset} is the dataset name. The HTML file in this case would look like this:


<\html>
<\head>
<title>Override document</title>
<\/head>
<\body> 
Test dataset

This is where the server administrator would supply
information about the dataset(s) and what-have-you.
<\/body>
<\/html>

Remember to ensure that the installation instructions cover installing the usage data file in a place where the server can find it.

ProgrammerGuideChapter4: Difference between revisions