JSON parser grammar

Douglas Crockford douglas at crockford.com
Fri Jun 12 16:10:05 PDT 2009


John Cowan wrote:
> Waldemar Horwat scripsit:
> 
>> I don't like the idea of having valid native ES strings that cannot
>> be serialized.  The sensible thing to do is to just escape surrogates,
>> whether they are paired or not.
> 
> Unfortunately, RFC 4627 says plainly in section 3:
> 
> 	JSON text SHALL be encoded in Unicode.
> 
> The cited version is Unicode 4.1.  As of Unicode 4.0, UTF-* documents
> are ill-formed if they contain unpaired surrogates; only the codepoints
> U+0000 to U+D7FF and U+E000 to U+10FFFF are encodable.  The fact that
> the ABNF seems to allow U+D800 to U+DFFF is irrelevant.
> 
>> This is an issue not just for surrogates.  There are 66 other code
>> units that are not Unicode characters.  For example:
>>
>> \uFFFE
>> \uFFFF
>>
>> These are covered by the same Unicode conformance clause as unpaired
>> surrogates, so we must treat them the same way.
> 
> Section 2.5 of the RFC says that all Unicode characters may appear
> within quotes except those that must be escaped: by clear implication,
> non-characters may not appear within quotes.  We are also told
> "Any character may be escaped", but there is no permission to escape
> non-characters.  This is appropriate, because JSON is an interchange
> format (per the Abstract), and non-characters should never be used
> in interchange.
> 
> In short, ES5 JSON encoders should check for non-characters and unpaired
> surrogates and refuse to encode them.

I think that is a serious misreading of the intent of the RFC.


More information about the es5-discuss mailing list