JSON parser grammar

John Cowan cowan at ccil.org
Fri Jun 12 13:49:28 PDT 2009

Waldemar Horwat scripsit:

> I don't like the idea of having valid native ES strings that cannot
> be serialized.  The sensible thing to do is to just escape surrogates,
> whether they are paired or not.

Unfortunately, RFC 4627 says plainly in section 3:

	JSON text SHALL be encoded in Unicode.

The cited version is Unicode 4.1.  As of Unicode 4.0, UTF-* documents
are ill-formed if they contain unpaired surrogates; only the codepoints
U+0000 to U+D7FF and U+E000 to U+10FFFF are encodable.  The fact that
the ABNF seems to allow U+D800 to U+DFFF is irrelevant.

> This is an issue not just for surrogates.  There are 66 other code
> units that are not Unicode characters.  For example:
> \uFFFE
> \uFFFF
> These are covered by the same Unicode conformance clause as unpaired
> surrogates, so we must treat them the same way.

Section 2.5 of the RFC says that all Unicode characters may appear
within quotes except those that must be escaped: by clear implication,
non-characters may not appear within quotes.  We are also told
"Any character may be escaped", but there is no permission to escape
non-characters.  This is appropriate, because JSON is an interchange
format (per the Abstract), and non-characters should never be used
in interchange.

In short, ES5 JSON encoders should check for non-characters and unpaired
surrogates and refuse to encode them.

The Unicode Standard does not encode            John Cowan
idiosyncratic, personal, novel, or private      http://www.ccil.org/~cowan
use characters, nor does it encode logos
or graphics.                                    cowan at ccil.org

More information about the es5-discuss mailing list