New full Unicode for ES6 idea

Brendan Eich brendan at mozilla.org
Tue Feb 21 07:30:04 PST 2012


Andrew Oakley wrote:
> On 02/20/12 16:47, Brendan Eich wrote:
>> >  Andrew Oakley wrote:
>>> >>  Issues only arise in code that tries to treat a string as an array of
>>> >>  16-bit integers, and I don't think we should be particularly bothered by
>>> >>  performance of code which misuses strings in this fashion (but clearly
>>> >>  this should still work without opt-in to new string handling).
>> >  
>> >  This is all strings in JS and the DOM, today.
>> >  
>> >  That is, we do not have any measure of code that treats strings as
>> >  uint16s, forges strings using "\uXXXX", etc. but the ES and DOM specs
>> >  have allowed this for>  14 years. Based on bitter experience, it's
>> >  likely that if we change by fiat to 21-bit code points from 16-bit code
>> >  units, some code on the Web will break.
>
> Sorry, I don't think I was particularly clear.  The point I was trying
> to make is that we can*pretend*  that code points are 16-bit but
> actually use a 21-bit representation internally.

So far, that's like Allen's proposal from last year 
(http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings). 
But you didn't say how iteration (indexing and .length) work.

> If content requests
> proper Unicode support we simply switch to allowing 21-bit code-points
> and stop encoding characters outside the BMP using surrogate pairs
> (because the characters now fit in a single code point).

How does content request proper Unicode support? Whatever that gesture 
is, it's big and red ;-). But we don't have such a switch or button to 
press like that, yet.

If a .js or .html file as fetched from a server has a UTF-8 encoding, 
indeed non-BMP characters in string literals will be transcoded in 
open-source browsers and JS engines that use uint16 vectors internally, 
but each part of the surrogate pair will take up one element in the 
uint16 vector. Let's take this now as a "content request" to use full 
Unicode. But the .js file was developed 8 years ago and assumes two code 
units, not one. It hardcodes for that assumption, somehow (indexing, 
.length exact value, indexOf('\ud800'), etc.). It is now broken.

And non-literal non-BMP characters won't be helped by transcoding 
differently when the .js or .html file is fetched. They'll just change 
"size" at runtime.

/be



More information about the es-discuss mailing list