Full Unicode strings strawman
allen at wirfs-brock.com
Mon May 16 19:46:06 PDT 2011
On May 16, 2011, at 6:51 PM, Mike Samuel wrote:
> 2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
>> It the string is written as \ud800\udc00\u0061" the 'a' will be at offset
>> 1, even in the new proposal. It would only be at offset 1 if it was written
>> as "\u+010000\u+000061" (using the literal notation from the proposal).
> Under this scheme,
> eval(' "\\uD834\\uDD1E" ') !== JSON.parse(' "\\uD834\\uDD1E" ')
> From RFC 4627
> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a twelve-character sequence,
> encoding the UTF-16 surrogate pair. So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
That's what is says, but its not what JSON parsers do. They don't a generate a single character ES string containing only the single character U+1D11E. They generate a 2 character ES string containing the character pair \uD834 and \uDD1E. Essentially JSON.parse currently generates UCS-2 strings that may be interpreted as UTF-16 by the application layer. Nothing would change in this regard under my proposal.
Interestingly, REC 4627 says that JSON text is "Unicode" and may be encoded in UTF-8, 16, or 32 and one of the alternatives of for the 'unescaped' production is %x5D-10FFFF. That sees to suggest that supplemental characters are allowed to occur in JSON text without escaping. JSON.parse doesn't support this. In fact, it looks like JSON.parse doesn't conform to the RFC 4626 parsing requirements. To do so, it would either have to be explicitly told whether the encoding used within the JS String argument was UTF-8 or UTF-16 or it should determine the encoding from the first four octets as described in section 3 of 4627...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss