RDH: Mapping the ODBC data model to the DAP2 data model

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽

This article is intended to provide the details of the type mapping activities required to build a functional Relational Database Handler (RDH) for the BES.


Background

The RDH is a BES handler plug-in that will allow users (data providers) to expose database tables and views through the BES interface (and then through Hyrax). This process is reliant on having a functional mapping between the data model of the RDBMS world and the DAP2 data model.

In the RDH, DAP2 DDX/DDS/DAS documents will be generated by using the introspection methods of the ODBC API to query the Data Source.

(We may later wish to use a configuration mechanism to identify which table and/or views that are meant to be served, or to instruct the server to serve all of the tables and/or views that it can discover using introspection.)


Definitions

Row Sets
In the world of the RDBMS everything is structurally a row set: A collection (tuple) of columns (variables) whose length my not be known until the content is delivered in total. Tables are row sets. A database view is a row set. SQL queries return row sets. SQL JOIN operations take row sets' as input and produce row sets.
ODBC Data Sources
An ODBC Data Source is a predefined connection to an RDBMS database instance. They defined at some system level on the (virtual) machine in which the BES is running.

Relational Database Data Types

Accessing data in an RDBMS is often a many layered affair. Many RDBMS implementations provide a native API library that can be used with a common programming language such as C, C++, C#, Visual Basic, or even Java. These API's interact directly with the RDBMS and provide the maximum level of flexibility, control and access. If one was using a native API to work with a RDBMS to get data then there would be a mapping from the native RDBMS data types to the language data types:


Native RDBMS Data Types ---> C++ Data Types


However, most software that works with RDMSs does so through a database connectivity API. The two most common are the Open Database Connectivity (ODBC) API and the Java Database Connectivity (JDBC) API. Using one of these database connectivity APIs adds a layer of data type mapping to the process:


Native RDBMS Data Types ---> Database Connectivity API Data Types ---> C++ Data Types


To complete the picture, generating DAP data objects requires the finally mapping from the programming language data types in to the DAP data model:


Native RDBMS Data Types ---> Database Connectivity API Data Types ---> C++ Data Types ---> DAP Data Types


For simple/atomic types such as integers an booleans this doesn't require much thought. For more complex types it may require both thought ad careful development to preserve the original semantics of the RDBMS data type across the various mappings and into the appropriate DAP data type. Even then the DAP data model may not have a data type that directly captures the semantics of the original type. For example PostgreSQL has a number of geometric types (such as point, box, and circle) whose content can be held in DAP types, but the semantic meaning (these two floating point values represent a point on a cartesian plain) is not explicit in the DAP object holding the values. That information can only be included in the metadata associated with the instance of the DAP variable.

SQL/ODBC to DAP2 Data Model Mapping

Row sets map naturally to the DAP Sequence data type, thus a reasonable representation of an SQL query is a Sequence: A repeating tuple of unknown length.

Each ODBC Data Source will become a DAP2 Dataset. Each table/view will be mapped to a DAP2 Sequence type. Each column variable in each table/view will be mapped to the appropriate simple DAP2 data type or array type.

Dataset {
    Sequence {
        String FirstName;
        String LastName;
        String PhoneNumber;
    } Authors;
 } AuthorsDB

Multiple tables/views become multiple Sequences

When a ODBC Data Source if found to contain multiple tables/views, each one will be represented as a DAP Sequence. Thus a DDS might contain multiple Sequences each representing a single table or view in the RDBMS:

Dataset {
    Sequence {
        String FirstName;
        String LastName;
        String PhoneNumber;
    } Authors;

    Sequence {
        String Title;
        String Publisher;
        Int16  Pages;
        String   CopyrightDate;
        String   ISBN;
    } Books;

} InventoryDB;

Atomic (Simple) Type Mappings

SQL/ODBC Type DAP2 Type
SQL_TINYINT DByte
SQL_SMALLINT DInt16
SQL_INTEGER DInt32
SQL_BIGINT DInt32 **NO SENSIBLE MAPPING (Need DInt64)
SQL_REAL DFloat32
SQL_FLOAT DFloat64
SQL_DOUBLE DFloat64
SQL_DECIMAL DFloat64 **NO SENSIBLE MAPPING (Need Some Kind Monsterous Floating point value)
SQL_NUMERIC DFloat64 **NO SENSIBLE MAPPING (ibid)
SQL_BIT DBoolean
SQL_CHAR DString
SQL_VARCHAR DString
SQL_LONGVARCHAR Should this be implemented to be read into a DString? Or is this really a "BLOB" type and therefore better represented as a DArray(of bytes).
SQL_BINARY DArray(of bytes)
SQL_VARBINARY DArray(of bytes)
SQL_LONGVARBINARY DArray(of bytes)
SQL_DATE DString
SQL_TIME DString
SQL_TIMESTAMP DString














Appendix

Database Connectivity API Mappings




ODBC Data Types

SQL/ODBC Type DAP2 Type
SQL_TINYINT DByte
SQL_SMALLINT DInt16
SQL_INTEGER DInt32
SQL_BIGINT DInt32 **NO SENSIBLE MAPPING (Need DInt64)
SQL_REAL DFloat32
SQL_FLOAT DFloat64
SQL_DOUBLE DFloat64
SQL_DECIMAL DFloat64 **NO SENSIBLE MAPPING (Need Some Kind Monsterous Floating point value)
SQL_NUMERIC DFloat64 **NO SENSIBLE MAPPING (ibid)
SQL_BIT DBoolean
SQL_CHAR DString
SQL_VARCHAR DString
SQL_LONGVARCHAR Implemented to be read into a DString, although it is a "BLOB" type and might be better represented as a DArray(of bytes).
SQL_BINARY DArray(of bytes)
SQL_VARBINARY DArray(of bytes)
SQL_LONGVARBINARY DArray(of bytes)
SQL_DATE DString
SQL_TIME DString
SQL_TIMESTAMP DString





JDBC Data Types

The DRDS used a JDBC connection to retrieve data from a relational database. It is important to note that their are several layers of type translation happening in this:

Database -> JDBC -> Java -> DODS

The Database types are the native types for the particular database that is being read from. The translation from Database->JDBC was handled before the data arrived at the DRDS (most likely by the JDBC Drivers). The mapping of JDBC type to DODS types (the intermediate Java types happen in the process) looked like this:

JDBC Type DAP Type
TINYINT DByte
SMALLINT DInt16
INTEGER DInt32
BIGINT DInt32 **NO SENSIBLE MAPPING (Need DInt64)
REAL DFloat32
FLOAT DFloat64
DOUBLE DFloat64
DECIMAL DFloat64 **NO SENSIBLE MAPPING (Need Some Kind Monsterous Floating point value)
NUMERIC DFloat64 **NO SENSIBLE MAPPING (ibid)
BIT DBoolean
CHAR DString
VARCHAR DString
LONGVARCHAR Implemented to be read into a DString, although it is a "BLOB" type and might be better represented as a DArray(of bytes).
BINARY DArray(of bytes)
VARBINARY DArray(of bytes)
LONGVARBINARY DArray(of bytes)
DATE DString
TIME DString
TIMESTAMP DString




Native SQL Data Types by Implementation




PostgreSQL Data Types

Todo:
 !! This section needs to be updated and otherwise corrected. !!
Boolean and Binary Types
Data Type Description Standardization Logical DAP data type association.
boolean, bool A single true or false value. SQL99 Boolean
bit(n) An n -length bit string (exactly n binary bits). SQL92 None
bit varying(n), varbit(n) A variable n -length bit string (up to n binary bits) SQL92 None
Character Types
Data Type Description Storage Standardization Logical DAP data type association.
character (n ), char(n ) A fixed n -length character string. (4+n) bytes SQL89 String
character varying(n), varchar(n) A variable length character string of up to n characters. Up to (4+n) bytes SQL92 String
text A variable length character string, of unlimited length. Variable PostgreSQL-specific String
Numeric Types
Data Type Description Storage Standardization Logical DAP data type association.
smallint, int2 A signed 2-byte integer 2 bytes SQL89 Int16
integer, int, int4 A signed, fixed-precision 4-byte number. 4 bytes SQL92 Int32
bigint, int8 A signed 8-byte integer, up to 18 digits in length. 8 bytes PostgreSQL-specific None (Need Int64)
real, float4 A 4-byte floating point number. 4 bytes SQL89 Float32
double precision, float8, float An 8-byte floating point number 8 bytes SQL89 Float64
numeric(p,s), decimal(p,s) An exact numeric type with arbitrary precision p, and scale s. Variable SQL99 None
money A fixed precision, U.S.-style currency. 4 bytes PostgreSQL-specific, deprecated. None
serial An auto-incrementing 4-byte integer. 4 bytes PostgreSQL-specific. None
Date and Time Types
Data Type Description Storage Standardization Logical DAP data type association.
date A calendar date (day, month, year). 4 bytes SQL92 String
time The time of day. 4 bytes SQL92 String
time with time zone The time of day, including time zone information. 4 bytes SQL92 String
timestamp (includes time zone) Both date and time. 8 bytes SQL92 String
interval An arbitrary specified length of time 12 bytes SQL92 String
Geometric Types
Data Type Description Storage Standardization Logical DAP data type association.
box A rectangular box in a 2D plane. 32 bytes PostgreSQL-specific None ([2][2] Array of Float32)
line An infinite line in a 2D plane. ?? bytes PostgreSQL-specific None ([2][2] Array of Float32)
lineseg A finite line segment in a 2D plane. 32 bytes PostgreSQL-specific None ([2][2] Array of Float32)
circle A circle with center and radius. 24 bytes PostgreSQL-specific None ([3]Array of Float32)
path Open and closed geometric paths in a two-dimensional plane . 4+32*n bytes PostgreSQL-specific None (Array of Float32)
point geometric point in a 2D plane 16 bytes PostgreSQL-specific None ([2] Array of Float32)
polygon A closed geometric path in a 2D plane 4+32*n bytes PostgreSQL-specific None (Array of Float32)
Network Types
Data Type Description Standardization Logical DAP data type association.
cdir An IP network specification PostgreSQL-specific None
inet A network IP address, with optional subnet bits. PostgreSQL-specific None
macaddr A MAC address (e.g., an Ethernet card's hardware address). PostgreSQL-specific None
System Types
Data Type Description Standardization Logical DAP data type association.
oid An object (row) identifier. PostgreSQL-specific None
xid A transaction identifier PostgreSQL-specific None





Transact-SQL (Microsoft)

Todo:
 !! This section needs to be updated and otherwise corrected. !!
Type Description Storage Bytes DAP equiv.
bigint Integer (whole number) data from -2^63 (-9,223,372,036,854,775,808) through 2^63-1 (9,223,372,036,854,775,807). 8 bytes none
int Integer (whole number) data from -2^31 (-2,147,483,648) through 2^31 - 1 (2,147,483,647). 4 bytes
numeric Functionally equivalent to decimal.
decimal Fixed precision and scale numeric data from -10^38 +1 through 10^38 –1.
bit Integer data with either a 1 or 0 value
smallint Integer data from -2^15 (-32,768) through 2^15 - 1 (32,767). 2 bytes
tinyint Integer data from 0 through 255 1 bytes
smallmoney Monetary data values from -214,748.3648 through +214,748.3647, with accuracy to a ten-thousandth of a monetary unit.
money Monetary data values from -2^63 (-922,337,203,685,477.5808) through 2^63 - 1 (+922,337,203,685,477.5807), with accuracy to a ten-thousandth of a monetary unit.
float Floating precision number data with the following valid values: -1.79E + 308 through -2.23E - 308, 0 and 2.23E + 308 through 1.79E + 308. 4 or 8 bytes
real Floating precision number data with the following valid values: -3.40E + 38 through -1.18E - 38, 0 and 1.18E - 38 through 3.40E + 38. 4 bytes
date
datetimeoffset
datetime2
smalldatetime Date and time data from January 1, 1900, through June 6, 2079, with an accuracy of one minute.
datetime Date and time data from January 1, 1753, through December 31, 9999, with an accuracy of three-hundredths of a second, or 3.33 milliseconds.
time
char Fixed-length non-Unicode character data with a maximum length of 8,000 characters.
varchar Variable-length non-Unicode data with a maximum of 8,000 characters.
next Variable-length non-Unicode data with a maximum length of 2^31 - 1 (2,147,483,647) characters
nchar Fixed-length Unicode data with a maximum length of 4,000 characters
nvarchar Variable-length Unicode data with a maximum length of 4,000 characters. sysname is a system-supplied user-defined data type that is functionally equivalent to nvarchar(128) and is used to reference database object names.
ntext Variable-length Unicode data with a maximum length of 2^30 - 1 (1,073,741,823) characters.
binary Fixed-length binary data with a maximum length of 8,000 bytes.
varbinary Variable-length binary data with a maximum length of 8,000 bytes.
image Variable-length binary data with a maximum length of 2^31 - 1 (2,147,483,647) bytes.
cursor A reference to a cursor.
timestamp A database-wide unique number that gets updated every time a row gets updated.
hierarchyid
uniquieidentifier A globally unique identifier (GUID).
sql_variant
xml
table