RDH: Mapping the ODBC data model to the DAP2 data model: Difference between revisions
(21 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This article is intended to provide the details of the type mapping activities required to build a functional Relational Database Handler (RDH) for the BES. | |||
The DDX/DDS/DAS documents will be generated by introspection of the | == Background == | ||
The RDH is a BES handler plug-in that will allow users (data providers) to expose database tables and views through the BES interface (and then through Hyrax). This process is reliant on having a functional mapping between the data model of the RDBMS world and the DAP2 data model. | |||
In the RDH, DAP2 DDX/DDS/DAS documents will be generated by using the introspection methods of the ODBC API to query the Data Source. | |||
(We may later wish to use a configuration mechanism to identify which table and/or views that are meant to be served, or to instruct the server to serve all of the tables and/or views that it can discover using introspection.) | (We may later wish to use a configuration mechanism to identify which table and/or views that are meant to be served, or to instruct the server to serve all of the tables and/or views that it can discover using introspection.) | ||
== Definitions == | == Definitions == | ||
Line 18: | Line 20: | ||
: An ODBC Data Source is a predefined connection to an RDBMS database instance. They defined at some system level on the (virtual) machine in which the BES is running. | : An ODBC Data Source is a predefined connection to an RDBMS database instance. They defined at some system level on the (virtual) machine in which the BES is running. | ||
== Model | == Relational Database Data Types == | ||
Accessing data in an RDBMS is often a many layered affair. Many RDBMS implementations provide a native API library that can be used with a common programming language such as C, C++, C#, Visual Basic, or even Java. These API's interact directly with the RDBMS and provide the maximum level of flexibility, control and access. If one was using a native API to work with a RDBMS to get data then there would be a mapping from the native RDBMS data types to the language data types: | |||
:''Native RDBMS Data Types'' '''--->''' ''C++ Data Types'' | |||
However, most software that works with RDMSs does so through a database connectivity API. The two most common are the Open Database Connectivity (ODBC) API and the Java Database Connectivity (JDBC) API. Using one of these database connectivity APIs adds a layer of data type mapping to the process: | |||
:''Native RDBMS Data Types'' '''--->''' ''Database Connectivity API Data Types'' '''--->''' ''C++ Data Types'' | |||
To complete the picture, generating DAP data objects requires the finally mapping from the programming language data types in to the DAP data model: | |||
:''Native RDBMS Data Types'' '''--->''' ''Database Connectivity API Data Types'' '''--->''' ''C++ Data Types'' '''--->''' ''DAP Data Types'' | |||
For simple/atomic types such as integers an booleans this doesn't require much thought. For more complex types it may require both thought ad careful development to preserve the original semantics of the RDBMS data type across the various mappings and into the appropriate DAP data type. Even then the DAP data model may not have a data type that directly captures the semantics of the original type. For example PostgreSQL has a number of geometric types (such as ''point'', ''box'', and ''circle'') whose content can be held in DAP types, but the semantic meaning (''these two floating point values represent a point on a cartesian plain'') is not explicit in the DAP object holding the values. That information can only be included in the metadata associated with the instance of the DAP variable. | |||
== SQL/ODBC to DAP2 Data Model Mapping == | |||
''Row sets'' map naturally to the DAP Sequence data type, thus a reasonable representation of an SQL query is a Sequence: A repeating tuple of unknown length. | ''Row sets'' map naturally to the DAP Sequence data type, thus a reasonable representation of an SQL query is a Sequence: A repeating tuple of unknown length. | ||
Line 82: | Line 106: | ||
| SQL_VARCHAR || DString | | SQL_VARCHAR || DString | ||
|- | |- | ||
| SQL_LONGVARCHAR || | | SQL_LONGVARCHAR || <font color="red"> Should this be implemented to be read into a DString? Or is this really a "BLOB" type and therefore better represented as a DArray(of bytes). </font> | ||
|- | |- | ||
| SQL_BINARY || DArray(of bytes) | | SQL_BINARY || DArray(of bytes) | ||
Line 99: | Line 123: | ||
---- | |||
---- | |||
---- | |||
---- | |||
---- | |||
---- | |||
==Appendix== | |||
=== Database Connectivity API Mappings === | |||
<br/> | |||
=== Database Connectivity | |||
---- | ---- | ||
---- | ---- | ||
Line 226: | Line 243: | ||
---- | ---- | ||
---- | ---- | ||
<br/> | |||
<br/> | |||
=== Native SQL Data Types by Implementation === | === Native SQL Data Types by Implementation === | ||
<br/> | |||
---- | ---- | ||
---- | ---- | ||
Line 409: | Line 427: | ||
| table || || || | | table || || || | ||
|} | |} | ||
Latest revision as of 20:46, 18 May 2009
This article is intended to provide the details of the type mapping activities required to build a functional Relational Database Handler (RDH) for the BES.
Background
The RDH is a BES handler plug-in that will allow users (data providers) to expose database tables and views through the BES interface (and then through Hyrax). This process is reliant on having a functional mapping between the data model of the RDBMS world and the DAP2 data model.
In the RDH, DAP2 DDX/DDS/DAS documents will be generated by using the introspection methods of the ODBC API to query the Data Source.
(We may later wish to use a configuration mechanism to identify which table and/or views that are meant to be served, or to instruct the server to serve all of the tables and/or views that it can discover using introspection.)
Definitions
- Row Sets
- In the world of the RDBMS everything is structurally a row set: A collection (tuple) of columns (variables) whose length my not be known until the content is delivered in total. Tables are row sets. A database view is a row set. SQL queries return row sets. SQL JOIN operations take row sets' as input and produce row sets.
- ODBC Data Sources
- An ODBC Data Source is a predefined connection to an RDBMS database instance. They defined at some system level on the (virtual) machine in which the BES is running.
Relational Database Data Types
Accessing data in an RDBMS is often a many layered affair. Many RDBMS implementations provide a native API library that can be used with a common programming language such as C, C++, C#, Visual Basic, or even Java. These API's interact directly with the RDBMS and provide the maximum level of flexibility, control and access. If one was using a native API to work with a RDBMS to get data then there would be a mapping from the native RDBMS data types to the language data types:
- Native RDBMS Data Types ---> C++ Data Types
However, most software that works with RDMSs does so through a database connectivity API. The two most common are the Open Database Connectivity (ODBC) API and the Java Database Connectivity (JDBC) API. Using one of these database connectivity APIs adds a layer of data type mapping to the process:
- Native RDBMS Data Types ---> Database Connectivity API Data Types ---> C++ Data Types
To complete the picture, generating DAP data objects requires the finally mapping from the programming language data types in to the DAP data model:
- Native RDBMS Data Types ---> Database Connectivity API Data Types ---> C++ Data Types ---> DAP Data Types
For simple/atomic types such as integers an booleans this doesn't require much thought. For more complex types it may require both thought ad careful development to preserve the original semantics of the RDBMS data type across the various mappings and into the appropriate DAP data type. Even then the DAP data model may not have a data type that directly captures the semantics of the original type. For example PostgreSQL has a number of geometric types (such as point, box, and circle) whose content can be held in DAP types, but the semantic meaning (these two floating point values represent a point on a cartesian plain) is not explicit in the DAP object holding the values. That information can only be included in the metadata associated with the instance of the DAP variable.
SQL/ODBC to DAP2 Data Model Mapping
Row sets map naturally to the DAP Sequence data type, thus a reasonable representation of an SQL query is a Sequence: A repeating tuple of unknown length.
Each ODBC Data Source will become a DAP2 Dataset. Each table/view will be mapped to a DAP2 Sequence type. Each column variable in each table/view will be mapped to the appropriate simple DAP2 data type or array type.
Dataset { Sequence { String FirstName; String LastName; String PhoneNumber; } Authors; } AuthorsDB
Multiple tables/views become multiple Sequences
When a ODBC Data Source if found to contain multiple tables/views, each one will be represented as a DAP Sequence. Thus a DDS might contain multiple Sequences each representing a single table or view in the RDBMS:
Dataset { Sequence { String FirstName; String LastName; String PhoneNumber; } Authors; Sequence { String Title; String Publisher; Int16 Pages; String CopyrightDate; String ISBN; } Books; } InventoryDB;
Atomic (Simple) Type Mappings
SQL/ODBC Type | DAP2 Type |
---|---|
SQL_TINYINT | DByte |
SQL_SMALLINT | DInt16 |
SQL_INTEGER | DInt32 |
SQL_BIGINT | DInt32 **NO SENSIBLE MAPPING (Need DInt64) |
SQL_REAL | DFloat32 |
SQL_FLOAT | DFloat64 |
SQL_DOUBLE | DFloat64 |
SQL_DECIMAL | DFloat64 **NO SENSIBLE MAPPING (Need Some Kind Monsterous Floating point value) |
SQL_NUMERIC | DFloat64 **NO SENSIBLE MAPPING (ibid) |
SQL_BIT | DBoolean |
SQL_CHAR | DString |
SQL_VARCHAR | DString |
SQL_LONGVARCHAR | Should this be implemented to be read into a DString? Or is this really a "BLOB" type and therefore better represented as a DArray(of bytes). |
SQL_BINARY | DArray(of bytes) |
SQL_VARBINARY | DArray(of bytes) |
SQL_LONGVARBINARY | DArray(of bytes) |
SQL_DATE | DString |
SQL_TIME | DString |
SQL_TIMESTAMP | DString |
Appendix
Database Connectivity API Mappings
ODBC Data Types
SQL/ODBC Type | DAP2 Type |
---|---|
SQL_TINYINT | DByte |
SQL_SMALLINT | DInt16 |
SQL_INTEGER | DInt32 |
SQL_BIGINT | DInt32 **NO SENSIBLE MAPPING (Need DInt64) |
SQL_REAL | DFloat32 |
SQL_FLOAT | DFloat64 |
SQL_DOUBLE | DFloat64 |
SQL_DECIMAL | DFloat64 **NO SENSIBLE MAPPING (Need Some Kind Monsterous Floating point value) |
SQL_NUMERIC | DFloat64 **NO SENSIBLE MAPPING (ibid) |
SQL_BIT | DBoolean |
SQL_CHAR | DString |
SQL_VARCHAR | DString |
SQL_LONGVARCHAR | Implemented to be read into a DString, although it is a "BLOB" type and might be better represented as a DArray(of bytes). |
SQL_BINARY | DArray(of bytes) |
SQL_VARBINARY | DArray(of bytes) |
SQL_LONGVARBINARY | DArray(of bytes) |
SQL_DATE | DString |
SQL_TIME | DString |
SQL_TIMESTAMP | DString |
JDBC Data Types
The DRDS used a JDBC connection to retrieve data from a relational database. It is important to note that their are several layers of type translation happening in this:
- Database -> JDBC -> Java -> DODS
The Database types are the native types for the particular database that is being read from. The translation from Database->JDBC was handled before the data arrived at the DRDS (most likely by the JDBC Drivers). The mapping of JDBC type to DODS types (the intermediate Java types happen in the process) looked like this:
JDBC Type | DAP Type |
---|---|
TINYINT | DByte |
SMALLINT | DInt16 |
INTEGER | DInt32 |
BIGINT | DInt32 **NO SENSIBLE MAPPING (Need DInt64) |
REAL | DFloat32 |
FLOAT | DFloat64 |
DOUBLE | DFloat64 |
DECIMAL | DFloat64 **NO SENSIBLE MAPPING (Need Some Kind Monsterous Floating point value) |
NUMERIC | DFloat64 **NO SENSIBLE MAPPING (ibid) |
BIT | DBoolean |
CHAR | DString |
VARCHAR | DString |
LONGVARCHAR | Implemented to be read into a DString, although it is a "BLOB" type and might be better represented as a DArray(of bytes). |
BINARY | DArray(of bytes) |
VARBINARY | DArray(of bytes) |
LONGVARBINARY | DArray(of bytes) |
DATE | DString |
TIME | DString |
TIMESTAMP | DString |
Native SQL Data Types by Implementation
PostgreSQL Data Types
- Todo:
- !! This section needs to be updated and otherwise corrected. !!
Boolean and Binary Types
Data Type | Description | Standardization | Logical DAP data type association. |
---|---|---|---|
boolean, bool | A single true or false value. | SQL99 | Boolean |
bit(n) | An n -length bit string (exactly n binary bits). | SQL92 | None |
bit varying(n), varbit(n) | A variable n -length bit string (up to n binary bits) | SQL92 | None |
Character Types
Data Type | Description | Storage | Standardization | Logical DAP data type association. |
---|---|---|---|---|
character (n ), char(n ) | A fixed n -length character string. | (4+n) bytes | SQL89 | String |
character varying(n), varchar(n) | A variable length character string of up to n characters. | Up to (4+n) bytes | SQL92 | String |
text | A variable length character string, of unlimited length. | Variable | PostgreSQL-specific | String |
Numeric Types
Data Type | Description | Storage | Standardization | Logical DAP data type association. |
---|---|---|---|---|
smallint, int2 | A signed 2-byte integer | 2 bytes | SQL89 | Int16 |
integer, int, int4 | A signed, fixed-precision 4-byte number. | 4 bytes | SQL92 | Int32 |
bigint, int8 | A signed 8-byte integer, up to 18 digits in length. | 8 bytes | PostgreSQL-specific | None (Need Int64) |
real, float4 | A 4-byte floating point number. | 4 bytes | SQL89 | Float32 |
double precision, float8, float | An 8-byte floating point number | 8 bytes | SQL89 | Float64 |
numeric(p,s), decimal(p,s) | An exact numeric type with arbitrary precision p, and scale s. | Variable | SQL99 | None |
money | A fixed precision, U.S.-style currency. | 4 bytes | PostgreSQL-specific, deprecated. | None |
serial | An auto-incrementing 4-byte integer. | 4 bytes | PostgreSQL-specific. | None |
Date and Time Types
Data Type | Description | Storage | Standardization | Logical DAP data type association. |
---|---|---|---|---|
date | A calendar date (day, month, year). | 4 bytes | SQL92 | String |
time | The time of day. | 4 bytes | SQL92 | String |
time with time zone | The time of day, including time zone information. | 4 bytes | SQL92 | String |
timestamp (includes time zone) | Both date and time. | 8 bytes | SQL92 | String |
interval | An arbitrary specified length of time | 12 bytes | SQL92 | String |
Geometric Types
Data Type | Description | Storage | Standardization | Logical DAP data type association. |
---|---|---|---|---|
box | A rectangular box in a 2D plane. | 32 bytes | PostgreSQL-specific | None ([2][2] Array of Float32) |
line | An infinite line in a 2D plane. | ?? bytes | PostgreSQL-specific | None ([2][2] Array of Float32) |
lineseg | A finite line segment in a 2D plane. | 32 bytes | PostgreSQL-specific | None ([2][2] Array of Float32) |
circle | A circle with center and radius. | 24 bytes | PostgreSQL-specific | None ([3]Array of Float32) |
path | Open and closed geometric paths in a two-dimensional plane . | 4+32*n bytes | PostgreSQL-specific | None (Array of Float32) |
point | geometric point in a 2D plane | 16 bytes | PostgreSQL-specific | None ([2] Array of Float32) |
polygon | A closed geometric path in a 2D plane | 4+32*n bytes | PostgreSQL-specific | None (Array of Float32) |
Network Types
Data Type | Description | Standardization | Logical DAP data type association. |
---|---|---|---|
cdir | An IP network specification | PostgreSQL-specific | None |
inet | A network IP address, with optional subnet bits. | PostgreSQL-specific | None |
macaddr | A MAC address (e.g., an Ethernet card's hardware address). | PostgreSQL-specific | None |
System Types
Data Type | Description | Standardization | Logical DAP data type association. |
---|---|---|---|
oid | An object (row) identifier. | PostgreSQL-specific | None |
xid | A transaction identifier | PostgreSQL-specific | None |
Transact-SQL (Microsoft)
- Todo:
- !! This section needs to be updated and otherwise corrected. !!
Type | Description | Storage Bytes | DAP equiv. |
---|---|---|---|
bigint | Integer (whole number) data from -2^63 (-9,223,372,036,854,775,808) through 2^63-1 (9,223,372,036,854,775,807). | 8 bytes | none |
int | Integer (whole number) data from -2^31 (-2,147,483,648) through 2^31 - 1 (2,147,483,647). | 4 bytes | |
numeric | Functionally equivalent to decimal. | ||
decimal | Fixed precision and scale numeric data from -10^38 +1 through 10^38 –1. | ||
bit | Integer data with either a 1 or 0 value | ||
smallint | Integer data from -2^15 (-32,768) through 2^15 - 1 (32,767). | 2 bytes | |
tinyint | Integer data from 0 through 255 | 1 bytes | |
smallmoney | Monetary data values from -214,748.3648 through +214,748.3647, with accuracy to a ten-thousandth of a monetary unit. | ||
money | Monetary data values from -2^63 (-922,337,203,685,477.5808) through 2^63 - 1 (+922,337,203,685,477.5807), with accuracy to a ten-thousandth of a monetary unit. | ||
float | Floating precision number data with the following valid values: -1.79E + 308 through -2.23E - 308, 0 and 2.23E + 308 through 1.79E + 308. | 4 or 8 bytes | |
real | Floating precision number data with the following valid values: -3.40E + 38 through -1.18E - 38, 0 and 1.18E - 38 through 3.40E + 38. | 4 bytes | |
date | |||
datetimeoffset | |||
datetime2 | |||
smalldatetime | Date and time data from January 1, 1900, through June 6, 2079, with an accuracy of one minute. | ||
datetime | Date and time data from January 1, 1753, through December 31, 9999, with an accuracy of three-hundredths of a second, or 3.33 milliseconds. | ||
time | |||
char | Fixed-length non-Unicode character data with a maximum length of 8,000 characters. | ||
varchar | Variable-length non-Unicode data with a maximum of 8,000 characters. | ||
next | Variable-length non-Unicode data with a maximum length of 2^31 - 1 (2,147,483,647) characters | ||
nchar | Fixed-length Unicode data with a maximum length of 4,000 characters | ||
nvarchar | Variable-length Unicode data with a maximum length of 4,000 characters. sysname is a system-supplied user-defined data type that is functionally equivalent to nvarchar(128) and is used to reference database object names. | ||
ntext | Variable-length Unicode data with a maximum length of 2^30 - 1 (1,073,741,823) characters. | ||
binary | Fixed-length binary data with a maximum length of 8,000 bytes. | ||
varbinary | Variable-length binary data with a maximum length of 8,000 bytes. | ||
image | Variable-length binary data with a maximum length of 2^31 - 1 (2,147,483,647) bytes. | ||
cursor | A reference to a cursor. | ||
timestamp | A database-wide unique number that gets updated every time a row gets updated. | ||
hierarchyid | |||
uniquieidentifier | A globally unique identifier (GUID). | ||
sql_variant | |||
xml | |||
table |