JSON parser grammar
david-sarah at jacaranda.org
Tue Jun 9 15:30:14 PDT 2009
John Cowan wrote:
> Tyler Close scripsit:
>> Does it make sense to specify the character escaping rules as a
>> whitelist of Unicode general categories that don't need to be escaped;
>> leaving all else to be escaped? Seems safer than a blacklist of
>> specific Unicode characters to escape.
> Not really. The point of the blacklist is that some ES3 implementations
> simply drop certain characters when they appear literally in "eval" input,
> so they need to be escaped.
Either drop or reject; they need to be escaped either way.
> The exact set of characters is the union of all such known bugs.
To be more precise,
- dropping format-control characters in ECMAScript source is correct
according to ES3;
- rejecting <LS> and <PS> in an ECMAScript string literal is correct
according to ES3 and draft ES5;
- rejecting format-control characters in ECMAScript source is incorrect
according to ES3 and draft ES5, but is done anyway by several JS
- dropping or rejecting any other code units, including UTF-16 surrogates,
is incorrect according to ES3 and draft ES5 (although accepting
noncharacters is incorrect according to the Unicode standard);
- it is therefore incorrect for a JSON parser to use 'eval' without
first escaping at least format-control characters, <LS>, and <PS>.
Many JSON parsers have this bug, including the one in section 6 of
the JSON RFC;
- furthermore, an eval-based JSON parser that does not escape
any other code units that are dropped or rejected by the JS
implementation it is running on, will fail to conform to JSON,
even though the bug is strictly speaking in the JS implementation
rather than the parser.
All of this can be worked around by doing escaping in the emitter,
which compensates for a nonconformant eval-based JSON parser failing
to do so, and is harmless to conformant parsers.
Since there is no known problem with code units that do not correspond
to format-control characters, noncharacters, or other characters on
the list, there's no rationale for escaping those.
There is a possibility of additional format-control characters being
ES5 before they upgrade the lexer to recognise such additional
characters, that would not cause a problem.
David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com
More information about the es5-discuss