DAP4: DAP4 Escapes

From OPeNDAP Documentation
⧼opendap2-jumptonavigation⧽

<<back to OPULS Development

Background

The character escaping mechanisms in DAP2 have, in retrospect, caused signficant confusion and led to conflicting implementations.

Character Escaping (aka escapes) occur in several places.

  1. Identifiers: some characters in identifiers (blanks, for example) require escaping in certain syntactic situations in order to properly be interpreted.
  2. String and character constants: at least the surrounding quote character (typically " or ' ) requires some form of escape so that it can occur inside string or character constants.
  3. Queries: a number of characters ('&','.',etc) have special meaning when they occur as part of a DAP2 query, and so they require escaping if they are, for example, part of an identifier or constant.

In retrospect, the DAP2 escape mechanism was chosen to be the same as the standard URL escape mechanism, when a character was converted to two hex digits and represented as %HH, where H is a hex digit. Especially when do escapes in queries, this led to confusion about when one was doing DAP2 escaping and when one was doing URL escaping.

Proposal

The primary proposal is to ensure that we use escaping mechanisms that are clearly not the same as the standard URL escaping mechanism.

For DAP4, the problem simplifies significantly because the DDX uses XML, so we can directly use the standard XML Entity escape mechanism, which in its most general form is &#DDD;, that is, an ampersand followed by a sharp followed by some number of decimal digits followed by a semicolon.

In practice, only four escape characters are needed.

  1. &amp; (&)
  2. &gt; (>)
  3. &lt; (<)
  4. &quot; (")

Note that XML entities may also occur in attribute values, so it can be used as the general escape mechanism in XML and is quite distinct from the URL encoding format.

The issue that needs to be addressed is how do to escapes in queries (i.e, anything after the left-most '?' in the URL. I would propose we choose one of the two following options.

  1. Standard C/Java, etc '\' escaping, where encountering \c is changes the interpretation of character c.
  2. Use some other escape mechanism that supports general representation as a sequence of hex, octal, or decimal digits preceded by some marker. Examples include the following.
    1. XML notation
    2. \xHH notation
    3. %HH notation (seriously deprecated because of the problems we saw with DAP2.

Discussion

Using the backslash escaping mechanism has the advantage that it is well known and is easy to parse. It has the disadvantage that it still leaves the offending character in the string (i.e. "\/", for example still leaves the '/' as part of the string; not a big point, but may be worth thinking about to see if it causes other problems.

Using, for example, the XML escape mechanism has the advantage of consistency. It is, however, slightly more complicated to parse. It is also somewhat more difficult for a user to build by hand a url containing a query.

Dennis Heimbigner

Jimg 15:56, 10 April 2012 (PDT) I favor the backslash (\) escaping for DAP4 with the assumption that some clients will apply the HTTP/URL escaping as they see fit. The servers will first 'unescape' for the HTTP encoding. During processing they will use the DAP4-escaped data as appropriate. For example, a parser for the constraint expression would need to have dots (.) in variable names escaped so that it will correctly parse (while 'structural' dots would not be escaped).

Why we need two escaping mechanisms

The answer is that we don't - we need only one, but we must realized that in addition to the DAP4 escaping, some clients will employ HTTP/URL escaping and it's out of our control. DAP4 will provide the application layer escaping while HTTP may provide transport layer escaping. The clients need know nothing about this. The servers must know it the HTTP frameworks with which they work will undo the HTTP/URL escaping (Apache does not) that may have been applied to the Query String part of the URL. If it does not, the server becomes responsible for that.