Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 19:46:06 PDT 2011

It already ins't the case that eval(x)===JSON.parse(x).  See http://timelessrepo.com/json-isnt-a-javascript-subset 

On May 16, 2011, at 6:51 PM, Mike Samuel wrote:

> 2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
>> It the string is written as   \ud800\udc00\u0061" the 'a' will be at offset
>> 1, even in the new proposal.  It would only be at offset 1 if it was written
>> as "\u+010000\u+000061"  (using the literal notation from the proposal).
> Under this scheme,
>     eval('  "\\uD834\\uDD1E"  ')  !== JSON.parse('  "\\uD834\\uDD1E"  ')

It already ins't the case that eval(x)===JSON.parse(x) is not necessarily true for values of x that are valid JSON source strings.  See http://timelessrepo.com/json-isnt-a-javascript-subset 

> From RFC 4627
> """
>   To escape an extended character that is not in the Basic Multilingual
>   Plane, the character is represented as a twelve-character sequence,
>   encoding the UTF-16 surrogate pair.  So, for example, a string
>   containing only the G clef character (U+1D11E) may be represented as
>   "\uD834\uDD1E".
> """

That's what is says, but its not what JSON parsers do.  They don't a generate a single character ES string containing only the single character U+1D11E.  They generate a 2 character ES string containing the character pair \uD834 and \uDD1E.    Essentially JSON.parse currently generates UCS-2 strings that may be interpreted as UTF-16 by the application layer.  Nothing would change in this regard under my proposal.

Interestingly, REC 4627 says that  JSON text is "Unicode"  and may be encoded in UTF-8, 16, or 32 and one of the alternatives of for the 'unescaped' production is %x5D-10FFFF.  That sees to suggest that supplemental characters are allowed to occur in JSON text without escaping.  JSON.parse doesn't support this.  In fact, it looks like JSON.parse doesn't conform to the RFC 4626 parsing requirements. To do so, it would either have to be explicitly told whether the encoding used within the JS String argument was UTF-8 or UTF-16 or it should determine the encoding from the first four octets as described in section 3 of 4627...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/ab765027/attachment.html>

More information about the es-discuss mailing list