Unicode support in new ES6 spec draft

Mathias Bynens mathias at qiwi.be
Wed Jul 18 00:58:31 PDT 2012


On Tue, Jul 17, 2012 at 10:23 PM, Norbert Lindenberg
<ecmascript at norbertlindenberg.com> wrote:
>> To further clarify position.  I don't currently agree with Norbert's assertion WRT  "situations".  For more discussion see
>> https://bugs.ecmascript.org/show_bug.cgi?id=469
>> https://bugs.ecmascript.org/show_bug.cgi?id=525
>>
>> The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations.  As currently spec'ed, explicit UTF-16 escape sequences such as  \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers.  Such sequences currently  generate errors in existing implementations so there aren't any backwards issues.
>>
>> I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat,  \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.
>
> It's not a backwards compatibility issue.
>
> It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.
>
> And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.

As stated in https://bugs.ecmascript.org/show_bug.cgi?id=469, I agree
with Norbert’s sentiments here. With ES 5.1 it’s perfectly possible to
make a list of all Unicode characters that are allowed/disallowed in
IdentifierStart or IdentifierPart. So, you could say:

    The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is
disallowed in identifier names, even though it’s in the [Lo] category.
    var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t
match any of the allowed Unicode categories
    var 丽; // SyntaxError, as this is equivalent to the above

The latest ES 6 draft makes this far more complicated:

    The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is
disallowed in identifier names only if it’s written out using simple
Unicode escape sequences for the separate surrogate halves:
    var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t
match any of the allowed Unicode categories
    var 丽; // allowed
    var \u{2F800}; // allowed

It’s no longer possible to say “this symbol is allowed/disallowed” as
it depends on the way the symbol is represented.

Please consider fixing this inconsistency.


More information about the es-discuss mailing list