Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Thu May 19 11:01:16 PDT 2011


>>> The crucial win of Allen's proposal comes down the road, when someone in a certain locale *can* do s.indexOf(nonBMPChar) and win.
>> s.indexOf("\U+10000"),

> Ok, but "\U+..." does not work today.

Yes, that would be worth adding (IMO) as a convenience, regardless of whether the backend were UTF-16 or UTF-32.  Though requiring 6 digits is annoying.  I'd prefer something like \U+ffff or \U+10000 or \u+10FFFF being allowed, though you'd have to do something interesting if there were additional 0-9a-f after U+ffff/U+10000.  So \U+{ffff} could be explicit if necessary.

>> who cares that it ends up as UTF-16?  You can already do it, today, with s.indexOf("𐀀"). It happens that 𐀀 looks like d800 + dc00, but it still works.  Today.  This is no different than most other languages.

> My example was unclear. I meant something like a one-char indexOf where the result would be used to slice that char.
> That doesn't work today. That's the point.
I wonder if we could allow "char" to have 21 bits in number context, and be a surrogate pair in string contexts.  

> But hey, if JS does not need to change then we can avoid trouble and keep on using 16-bit indexing and length. Is this really the best outcome?

IMO we get 99% of what's needed by just changing to UTF-16 from UCS-2, although I'd like to see helpers like the U+10000 thing.

I think there are only 2 "tricky" parts with UTF-16 instead of UCS-2:
* Fixing the encode/decode url stuff so that it's UTF-8 instead of CESU-8.  (Actually, just encode since decode would be obvious I thnk).
* Optionally, for convenience, getting a 21 bit number from a string surrogate pair. (because the existing API wouldn't know if you wanted just the D800 or the 10000 represented by the D800, DC00 pair).  That could be useful for finding out if the pair is like one of the math bold forms. (you could just do 1D400 <= x <= 1D433 instead of trying to figure out the pairs).

-Shawn

> /be
Big-endian? ;-)


More information about the es-discuss mailing list