New full Unicode for ES6 idea
barraclough at apple.com
Mon Feb 20 12:21:54 PST 2012
On Feb 20, 2012, at 8:37 AM, Brendan Eich wrote:
> BRS makes 21-bit chars, so just as String.prototype.charCodeAt returns a code point, String.fromCharCode takes actual code point arguments.
> Again I'd reject (dynamically in the case of String.fromCharCode) any in [0xd800, 0xdfff]. Other code points that are not characters I'd let through to future-proof, but not these reserved ones. Also any > 0x10ffff.
Okay, gotcha – so to clarify, once the BRS is thrown, it should be impossible to create a string in which any individual element is an unassigned code point (e.g. an unpaired UTF-16 surrogate) – all strings elements should be valid unicode characters, right? (or maybe a slightly weaker form of this, all string elements must be code points in the ranges 0...0xD7FF or 0xE000...0x10FFFF?).
> Implementations that use uint16 vectors as the character data representation type for both "UCS-2" and "UTF-16" string variants would probably want another flag bit per string header indicating whether, for the UTF-16 case, the string indeed contained any non-BMP characters. If not, no proxy/copy needed.
If I understand your original proposal, you propose that UCS-2 strings coming from other sources be proxied to be iterated by unicode characters (e.g. if the DOM returns a string containing the code units "\uD800\uDC00" then JS code executing in a context with the BRS set will see this as having length 1, right?) If so, do you propose any special handling for access to unassigned unicode code points in UCS-2 strings returned from the DOM (or accessed from another global object, where the BRS is not set).
var ucs2d800 = foo(); // get a string containing "\uD800" from the DOM, or another global object in BRS=off mode;
var ucs2dc00 = bar(); // get a string containing "\uDC00" from the DOM, or another global object in BRS=off mode;
var a = ucs2d800;
var b = ucs2d800.charCodeAt(0);
var c = ucs2d800 + ucs2dc00;
var c0 = c.charCodeAt(0);
var c1 = c.charCodeAt(1);
If the proxy is to behave as is the UCS-2 sting has been converted to a valid unicode string, then I'm guessing that conversion should have converted the unmatched surrogates in the UCS-2 into unicode replacement characters? – if so, the length of c in the above example would be 2, and the values c0 & c1 would be 0xFFFD?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss