New full Unicode for ES6 idea

Gavin Barraclough barraclough at
Mon Feb 20 12:21:54 PST 2012

On Feb 20, 2012, at 8:37 AM, Brendan Eich wrote:
> BRS makes 21-bit chars, so just as String.prototype.charCodeAt returns a code point, String.fromCharCode takes actual code point arguments.
> Again I'd reject (dynamically in the case of String.fromCharCode) any in [0xd800, 0xdfff]. Other code points that are not characters I'd let through to future-proof, but not these reserved ones. Also any > 0x10ffff.

Okay, gotcha – so to clarify, once the BRS is thrown, it should be impossible to create a string in which any individual element is an unassigned code point (e.g. an unpaired UTF-16 surrogate) – all strings elements should be valid unicode characters, right? (or maybe a slightly weaker form of this, all string elements must be code points in the ranges 0...0xD7FF or 0xE000...0x10FFFF?).

> Implementations that use uint16 vectors as the character data representation type for both "UCS-2" and "UTF-16" string variants would probably want another flag bit per string header indicating whether, for the UTF-16 case, the string indeed contained any non-BMP characters. If not, no proxy/copy needed.

If I understand your original proposal, you propose that UCS-2 strings coming from other sources be proxied to be iterated by unicode characters (e.g. if the DOM returns a string containing the code units "\uD800\uDC00" then JS code executing in a context with the BRS set will see this as having length 1, right?)  If so, do you propose any special handling for access to unassigned unicode code points in UCS-2 strings returned from the DOM (or accessed from another global object, where the BRS is not set).

	var ucs2d800 = foo(); // get a string containing "\uD800" from the DOM, or another global object in BRS=off mode;
	var ucs2dc00 = bar(); // get a string containing "\uDC00" from the DOM, or another global object in BRS=off mode;
	var a = ucs2d800[0];
	var b = ucs2d800.charCodeAt(0);
	var c = ucs2d800 + ucs2dc00;
	var c0 = c.charCodeAt(0);
	var c1 = c.charCodeAt(1);

If the proxy is to behave as is the UCS-2 sting has been converted to a valid unicode string, then I'm guessing that conversion should have converted the unmatched surrogates in the UCS-2 into unicode replacement characters? – if so, the length of c in the above example would be 2, and the values c0 & c1 would be 0xFFFD?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the es-discuss mailing list