New full Unicode for ES6 idea
Brendan Eich
brendan at mozilla.com
Sun Feb 19 19:52:23 PST 2012
Gavin Barraclough wrote:
> One way in which the proposal under discussion seems to differ from
> the previous strawman is in the behavior arising from concatenation of
> strings ending/beginning with a surrogate hi and lo element.
> How do we want to handle how do we want to handle unpaired UTF-16
> surrogates in a full-unicode string? I can see three options:
>
> 1) Prohibit values from strings that do not map to valid unicode
> characters (either throw an exception, or replace with the unicode
> replacement character).
> 2) Allow invalid unicode characters in strings, and preserve them over
> concatenation – ("\uD800" + "\uDC00").length == 2.
> 3) Allow invalid unicode characters in strings, but allow surrogate
> pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
>
> It seems desirable for full-unicode strings to logically be a sequence
> of unicode characters, stored and processed in a encoding-agnostic
> manner. option 3 would seem to violate that, exposing the underlying
> UTF-16 implementation. It also loses a distributive property of
> .length over concatenation that I believe is true in ES5 for strings,
> in that currently for all strings s1 & s2:
> s1.length + s2.length == (s1 + s2).length
> However if we allow concatenation to fuse surrogate pairs into a
> single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no
> longer be true.
>
> I guess I wonder if it's worth considering either options 1) or 2) –
> either prohibiting invalid unicode characters in strings, or consider
> something closer to the previous strawman, where string storage is
> defined to be 32-bit (with a BRS that instead of changing iteration
> would change string creation, introducing an implicit UTF16-UTF32
> conversion).
Great post. I agree 3 is not good. I was thinking based on today's
exchanges that the BRS being set to "full Unicode" *could* mean that
"\uXXXX" is illegal and you *must* use "\u{...}" to write Unicode *code
points* (not code units).
Last year we dispensed with the binary data hacking in strings use-case.
I don't see the hardship. But rather than throw exceptions on
concatenation I would simply eliminate the ability to spell code units
with "\uXXXX" escapes. Who's with me?
/be
More information about the es-discuss
mailing list