New full Unicode for ES6 idea

Gavin Barraclough barraclough at apple.com
Sun Feb 19 22:46:48 PST 2012


On Feb 19, 2012, at 10:05 PM, Allen Wirfs-Brock wrote:

>> Great post. I agree 3 is not good. I was thinking based on today's exchanges that the BRS being set to "full Unicode" *could* mean that "\uXXXX" is illegal and you *must* use "\u{...}" to write Unicode *code points* (not code units).
>> 
>> Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?
> 
> I think we need to be careful not to equate the syntax of ES string literals with the actual encoding space of string elements.  Whether you say "\ud800" or "\u{00d800}", or call a function that does full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with string elements containing upper or lower half surrogates.    Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing. 
> 
> What it might do, however, is eliminate the ambiguity about the intended meaning of  "\uD800\uDc00" in legacy code.  If "full unicode string mode" only supported \u{} escapes then existing code that uses \uXXXX would have to be updated before it could be used in that mode.  That might be a good thing.

Ah, this is a good point.  I was going to ask whether it would be inconsistent to deprecate \uXXXX but not \xXX, since both could just be considered shorthand for \u{...}, but this is a good practical reason why it matters more for \uXXXX (and I can imagine there may be complaints if we take \xXX away!).

So, just to clarify,
	var s1 = "\u{0d800}\u{0dc00}";
	var s2 = String.fromCharCode(0xd800) + String.fromCharCode(0xdc00);
	s1.length === 2; // true
	s2.length === 2; // true
	s1 === s2; // true
Does this sound like the expected behavior?

Also, what would happen to String.fromCharCode?

1) Leave this unchanged, it would continue to truncate the input with ToUint16?
2) Change its behavior to allow any code point (maybe switch to ToUint32, or ToInteger, and throw a RangeError for input > 0x10FFFF?).
3) Make it sensitive to the state of the corresponding global object's BRS.

If we were to leave it unchanged, using ToUInt16, then I guess we would need a new String.fromCodePoint function, to be able to create strings for non-BMP characters?  Presumably we would then want a new String.codePointAt function, for symmetry?  This would also raise a question of what String.charCodeAt should return for code points outside of the Uint16 range – should it return the actual value, or ToUint16 of the code point to mirror the truncation performed by fromCharCode?

I guess my preference here would be to go with option 3 – tie the potentially breaking change to the BRS, but no need for new interface.

cheers,
G.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120219/6bcecb5c/attachment.html>


More information about the es-discuss mailing list