Unicode support in new ES6 spec draft

Norbert Lindenberg ecmascript at norbertlindenberg.com
Tue Jul 17 13:23:31 PDT 2012

On Jul 16, 2012, at 16:41 , Allen Wirfs-Brock wrote:

> On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:
>> Commenting on Norbert's comments…
>> ...
>>> Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
>> What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000.  This seems like the right way to go.
> To further clarify position.  I don't currently agree with Norbert's assertion WRT  "situations".  For more discussion see 
> https://bugs.ecmascript.org/show_bug.cgi?id=469 
> https://bugs.ecmascript.org/show_bug.cgi?id=525 
> The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations.  As currently spec'ed, explicit UTF-16 escape sequences such as  \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers.  Such sequences currently  generate errors in existing implementations so there aren't any backwards issues. 
> I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat,  \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.

It's not a backwards compatibility issue.

It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.

And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.

More information about the es-discuss mailing list