New full Unicode for ES6 idea

Brendan Eich brendan at mozilla.com
Mon Feb 20 08:37:24 PST 2012


Gavin Barraclough wrote:
>>
>> What it might do, however, is eliminate the ambiguity about the 
>> intended meaning of  "\uD800\uDc00" in legacy code.  If "full unicode 
>> string mode" only supported \u{} escapes then existing code that uses 
>> \uXXXX would have to be updated before it could be used in that mode. 
>>  That might be a good thing.
>
> Ah, this is a good point.  I was going to ask whether it would be 
> inconsistent to deprecate \uXXXX but not \xXX, since both could just 
> be considered shorthand for \u{...}, but this is a good practical 
> reason why it matters more for \uXXXX (and I can imagine there may be 
> complaints if we take \xXX away!).

Yes. "\xXX" is innocuous, since ISO 8859-1 is a proper subset of Unicode 
and can't be used to forge surrogate pair halves.

> So, just to clarify,
> var s1 = "\u{0d800}\u{0dc00}";
> var s2 = String.fromCharCode(0xd800) + String.fromCharCode(0xdc00);
> s1.length === 2; // true
> s2.length === 2; // true
> s1 === s2; // true
> Does this sound like the expected behavior?

Rather, I'd statically reject the invalid code points.

> Also, what would happen to String.fromCharCode?

BRS makes 21-bit chars, so just as String.prototype.charCodeAt returns a 
code point, String.fromCharCode takes actual code point arguments.

Again I'd reject (dynamically in the case of String.fromCharCode) any in 
[0xd800, 0xdfff]. Other code points that are not characters I'd let 
through to future-proof, but not these reserved ones. Also any > 0x10ffff.

> 1) Leave this unchanged, it would continue to truncate the input with 
> ToUint16?

No, that violates the BRS intent.

> 2) Change its behavior to allow any code point (maybe switch to 
> ToUint32, or ToInteger, and throw a RangeError for input > 0x10FFFF?).

The last.

> 3) Make it sensitive to the state of the corresponding global object's 
> BRS.

In any event, yes: this. The BRS is a switch, you can think of it as 
swapping in the other String implementation, or as a flag tested within 
one shared String implementation whose methods use if statements (which 
could be messy but would work).

We should specify carefully the identity or lack of identity of 
myGlobal.String and otherGlobalWithPossiblyDifferentBRSSetting.String, 
etc. Consider this one-line .html file:

<iframe src="javascript:alert(parent.String === String)"/>

I get false from Chrome, Firefox and Safari, as expected. So the BRS 
could swap in another String, or simply mutate hidden state associated 
with the global in question (as mentioned in my previous post, globals 
keep track of the original values of their built-ins' prototypes, so 
implementations could put the BRS in String or String.prototype too, and 
use random logic instead of separate objects).

> If we were to leave it unchanged, using ToUInt16, then I guess we 
> would need a new String.fromCodePoint function, to be able to 
> create strings for non-BMP characters?

This goes against the BRS design and falls down the Java slippery slope. 
We want one set of standard methods, extended from 16- to 21-bit chars, 
er, code points.

> I guess my preference here would be to go with option 3 – tie the 
> potentially breaking change to the BRS, but no need for new interface.

Definitely! That's perhaps unclear in my o.p. but I made a to-do out of 
rejecting Java and keeping the duplicate methods or hidden if statements 
under the "implementation hood" ("bonnet" for you ;-).

/be


More information about the es-discuss mailing list