New full Unicode for ES6 idea

Brendan Eich brendan at mozilla.com
Sun Feb 19 14:44:25 PST 2012


Allen Wirfs-Brock wrote:
> On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:
>> I'm not a Unicode expert but I believe the latter is called "character". 
>
> Me neither, but I believe the correct term is "code point" which refers to the full 21-bit code while "Unicode character" is the logical entity corresponding to that code point.   That usage of "character" is difference from the current usage within ECMAScript where "character" is what we call the elements of the vector of 16-bit number that are used to represent a String value.   You can access then as string values of length 1 via [ ] or as numeric values via the charCodeAt method.

Thanks. We have a confusing transposition of terms between Unicode and 
ECMA-262, it seems. Should we fix?

>> JS must keep the "\uXXXX" notation for uint16 storage units, and one can create invalid Unicode strings already. This hazard does not go away, we keep compatibility, but the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of invalid-Unicode-string bugs.
>>
>> The "\u{...}" notation is independent and should work whatever the BRS setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. In "UTF-16" setting, it makes only characters. And of course in the latter case indexing and length count characters.
>
> I think your names for the BRS modes are misleading.

You got me, in fact I used "full Unicode" for the BRS-thrown setting 
elsewhere.

My implementor's bias is showing, because I expect many engines would 
use UTF-16 internally and have non-O(1) indexing for strings with the 
contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.

/be


More information about the es-discuss mailing list