Unicode support in new ES6 spec draft

Allen Wirfs-Brock allen at wirfs-brock.com
Mon Jul 16 16:41:34 PDT 2012


On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:

> Commenting on Norbert's comments…
> ...
>> Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
> 
> What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000.  This seems like the right way to go.

To further clarify position.  I don't currently agree with Norbert's assertion WRT  "situations".  For more discussion see 
https://bugs.ecmascript.org/show_bug.cgi?id=469 
https://bugs.ecmascript.org/show_bug.cgi?id=525 

The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations.  As currently spec'ed, explicit UTF-16 escape sequences such as  \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers.  Such sequences currently  generate errors in existing implementations so there aren't any backwards issues. 

I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat,  \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.

> 
>> fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
> 
> But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

Depends upon how it is spec'ed. The legacy String.fromCharCode clamps values using toUint16 but does not reject them.  The current spec. for fromCodePoint throws for values > 0x10FFFF.  

> 
>> codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.
> 
> I still think it helps.

Note that charAt returns a string value, not a numeric code unit value. charCodeAt is used to retrieve a numeric code unit value at a specific string index position.  The proposed codePointAt also returns a numeric code point.  What is missing from this discussion is a method that returns a string value and which correctly interprets surrogate pairs. 

...

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120716/02e84ab5/attachment.html>


More information about the es-discuss mailing list