DAP4: DAP4 Escapes: Difference between revisions

Revision as of 16:52, 10 April 2012

Background

The character escaping mechanisms in DAP2 have, in retrospect, caused signficant confusion and led to conflicting implementations.

Character Escaping (aka escapes) occur in several places.

Identifiers: some characters in identifiers (blanks, for example) require escaping in certain syntactic situations in order to properly be interpreted.
String and character constants: at least the surrounding quote character (typically " or ' ) requires some form of escape so that it can occur inside string or character constants.
Queries: a number of characters ('&','.',etc) have special meaning when they occur as part of a DAP2 query, and so they require escaping if they are, for example, part of an identifier or constant.

In retrospect, the DAP2 escape mechanism was chosen to be the same as the standard URL escape mechanism, when a character was converted to two hex digits and represented as %HH, where H is a hex digit. Especially when do escapes in queries, this led to confusion about when one was doing DAP2 escaping and when one was doing URL escaping.

Proposal

The primary proposal is to ensure that we use escaping mechanisms that are clearly not the same as the standard URL escaping mechanism.

For DAP4, the problem simplifies significantly because the DDX uses XML, so we can directly use the standard XML Entity escape mechanism, which in its most general form is &#DDD;, that is, an ampersand followed by a sharp followed by some number of decimal digits followed by a semicolon.

In practice, only four escape characters are needed.

& (&)
> (>)
< (<)
" (")

Note that XML entities may also occur in attribute values, so it can be used as the general escape mechanism in XML and is quite distinct from the URL encoding format.

The issue that needs to be addressed is how do to escapes in queries (i.e, anything after the left-most '?' in the URL. I would propose we choose one of the two following options.

Standard C/Java, etc '\' escaping, where encountering \c is changes the interpretation of character c.
Use some other escape mechanism that supports general representation as a sequence of hex, octal, or decimal digits preceded by some marker. Examples include the following.
1. XML notation
2. \xHH notation
3. %HH notation (seriously deprecated because of the problems we saw with DAP2.

Discussion

Using the backslash escaping mechanism has the advantage that it is well known and is easy to parse. It has the disadvantage that it still leaves the offending character in the string (i.e. "\/", for example still leaves the '/' as part of the string; not a big point, but may be worth thinking about to see if it causes other problems.

Using, for example, the XML escape mechanism has the advantage of consistency. It is, however, slightly more complicated to parse. It is also somewhat more difficult for a user to build by hand a url containing a query.

Dennis Heimbigner

@@ Line 9: / Line 9: @@
 In retrospect, the DAP2 escape mechanism was chosen to be the same as the standard URL escape mechanism, when a character was converted to two hex digits and represented as %HH, where H is a hex digit. Especially when do escapes in queries, this led to confusion about when one was doing DAP2 escaping and when one was doing URL escaping.
-<ins>I want to promulgate a (IMO) important principle about
-escaping. Namely that the escaped form should not itself
-contain the original characters being escaped. This is somewhat
-difficult to define, but it seems to me that an escape mechanism
-should "inject" the unescaped character set into a smaller character set.</ins>
 ==Proposal ==
@@ Line 37: / Line 31: @@
 == Discussion ==
-<ins>Using the backslash escaping mechanism has the advantage that it is well known and is easy to parse. It has the disadvantage that it clearly violates the principal I stated above that the escaped character set should be "smaller" than the unescaped character set.
+<ins>Using the backslash escaping mechanism has the advantage that it is well known and is easy to parse. It has the disadvantage that it still leaves the offending character in the string (i.e. "\/", for example still leaves the '/' as part of the string; not a big point, but may be worth thinking about to see if it causes other problems.</ins>
-Using, for example, the XML escape mechanism has the advantage of consistency. It is, however, slightly more complicated to parse. It almost meets my principal of reducing the character set with the exception of the &amp; character.
+Using, for example, the XML escape mechanism has the advantage of consistency. It is, however, slightly more complicated to parse. It is also somewhat more difficult for a user to build by hand a url containing a query.
-My personal preference is to use the XML or equivalent notation. The XML escape mechanism will need to be used for characters like '.', and '/' and these do not have predefined mnemonic definitions, but we are not prevented from defining such entities ourselves.
+''Dennis Heimbigner''

DAP4: DAP4 Escapes: Difference between revisions

Revision as of 16:52, 10 April 2012

Background

Proposal

Discussion

Navigation menu

Page actions

Personal tools

Search

Tools