New full Unicode for ES6 idea

Andrew Oakley andrew at ado.is-a-geek.net
Tue Feb 21 05:11:49 PST 2012


On 02/20/12 16:47, Brendan Eich wrote:
> Andrew Oakley wrote:
>> Issues only arise in code that tries to treat a string as an array of
>> 16-bit integers, and I don't think we should be particularly bothered by
>> performance of code which misuses strings in this fashion (but clearly
>> this should still work without opt-in to new string handling).
> 
> This is all strings in JS and the DOM, today.
> 
> That is, we do not have any measure of code that treats strings as
> uint16s, forges strings using "\uXXXX", etc. but the ES and DOM specs
> have allowed this for > 14 years. Based on bitter experience, it's
> likely that if we change by fiat to 21-bit code points from 16-bit code
> units, some code on the Web will break.

Sorry, I don't think I was particularly clear.  The point I was trying
to make is that we can *pretend* that code points are 16-bit but
actually use a 21-bit representation internally.  If content requests
proper Unicode support we simply switch to allowing 21-bit code-points
and stop encoding characters outside the BMP using surrogate pairs
(because the characters now fit in a single code point).

> And as noted in the o.p. and in the thread based on Allen's proposal
> last year, browser implementations definitely count on representation
> via array of 16-bit integers, with length property or method counting same.
> 
> Breaking the Web is off the table. Breaking implementations, less so.
> I'm not sure why you bring up UTF-8. It's good for encoding and decoding
> but for JS, unlike C, we want string to be a high level "full Unicode"
> abstraction. Not bytes with bits optionally set indicating more bytes
> follow to spell code points.

Yes, I probably shouldn't have brought up UTF-8 (we do store strings
using UTF-8, I was thinking about our own implementation). The intention
was not to "break the web", my comments about issues when strings were
misused were purely *performance* concerns, behaviour would otherwise
remain unchanged (unless full Unicode support had been enabled).

-- 
Andrew Oakley


More information about the es-discuss mailing list