JSON parser grammar

David-Sarah Hopwood david-sarah at jacaranda.org
Tue Jun 9 15:30:14 PDT 2009


John Cowan wrote:
> Tyler Close scripsit:
> 
>> Does it make sense to specify the character escaping rules as a
>> whitelist of Unicode general categories that don't need to be escaped;
>> leaving all else to be escaped? Seems safer than a blacklist of
>> specific Unicode characters to escape.
> 
> Not really.  The point of the blacklist is that some ES3 implementations
> simply drop certain characters when they appear literally in "eval" input,
> so they need to be escaped.

Either drop or reject; they need to be escaped either way.

> The exact set of characters is the union of all such known bugs.

To be more precise,
 - dropping format-control characters in ECMAScript source is correct
   according to ES3;
 - rejecting <LS> and <PS> in an ECMAScript string literal is correct
   according to ES3 and draft ES5;
 - rejecting format-control characters in ECMAScript source is incorrect
   according to ES3 and draft ES5, but is done anyway by several JS
   implementations;
 - dropping or rejecting any other code units, including UTF-16 surrogates,
   is incorrect according to ES3 and draft ES5 (although accepting
   noncharacters is incorrect according to the Unicode standard);
 - it is therefore incorrect for a JSON parser to use 'eval' without
   first escaping at least format-control characters, <LS>, and <PS>.
   Many JSON parsers have this bug, including the one in section 6 of
   the JSON RFC;
 - furthermore, an eval-based JSON parser that does not escape
   any other code units that are dropped or rejected by the JS
   implementation it is running on, will fail to conform to JSON,
   even though the bug is strictly speaking in the JS implementation
   rather than the parser.

All of this can be worked around by doing escaping in the emitter,
which compensates for a nonconformant eval-based JSON parser failing
to do so, and is harmless to conformant parsers.

Since there is no known problem with code units that do not correspond
to format-control characters, noncharacters, or other characters on
the list, there's no rationale for escaping those.

There is a possibility of additional format-control characters being
added to Unicode, but provided that JavaScript implementations support
ES5 before they upgrade the lexer to recognise such additional
characters, that would not cause a problem.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com



More information about the es5-discuss mailing list