New full Unicode for ES6 idea
brendan at mozilla.com
Sun Feb 19 14:28:52 PST 2012
Phillips, Addison wrote:
> Why would converting the existing UCS-2 support to be UTF-16 not be a good idea? There is nothing intrinsically wrong that I can see with that approach and it would be the most compatible with existing scripts, with no special "modes", "flags", or interactions.
Allen proposed this, essentially (some confusion surrounded the
discussion by mixing observable-in-language with
encoding/format/serialization issues, leading to talk of 32-bit
characters), last year. As I wrote in the o.p., this led to two
objections: big implementation hit; incompatible change.
I tackled the second with the BRS and (in detail) mediation across DOM
window boundaries. This I believe takes the sting out of the first
(lesser implementation change in light of existing mediation at those
> Yes, the complexity of supplementary characters (i.e. non-BMP characters) represented as surrogate pairs must still be dealt with.
I'm not sure what you mean. JS today allows (ignoring invalid pairs)
such surrogates but they count as two indexes and add two to length, not
one. That is the first problem to fix (ignoring literal escape-notation
> It would also expose the possibility of invalid strings (with unpaired surrogates).
That problem exists today.
> But this would not be unlike other programming languages--or even ES as it exists today.
Right! We should do better. As I noted, Node.js heavy hitters (mranney
of Voxer) testify that they want full Unicode, not what's specified
today with indexing and length-accounting by uint16 storage units.
> The purity of a "Unicode string" would be watered down, but perhaps not fatally. The Java language went through this (yeah, I know, I know...) and seems to have emerged unscathed.
Java's dead on the client. It is used by botnets (bugzilla.mozilla.org
recently suffered a DDOS from one, the bad guys didn't even bother
changing the user-agent from the default one for the Java runtime). See
Brian Krebs' blog.
> Norbert has a lovely doc here about the choices that lead to this, which seems useful to consider: . W3C I18N Core WG has a wiki page shared with TC39 awhile ago here: .
> To me, switching to UTF-16 seems like a relatively small, containable, non-destructive change to allow supplementary character support.
I still don't know what you mean. How would what you call "switching to
UTF-16" differ from today, where one can inject surrogates into literals
by transcoding from an HTML document or .js file CSE?
In particular, what do string indexing and .length count, uint16 units
More information about the es-discuss