Full Unicode strings strawman
Shawn.Steele at microsoft.com
Thu May 19 15:35:55 PDT 2011
I'm still not at all convinced :) I don't buy that the linguistic case isn't interesting (though granted JS is really bad about that to date), and I don't buy that non-linguistic uses have any trouble with UTF-16. \UD800\UDC00 == \UD800\UDC00 just as easily in UTF-16. The "only" advantage is that supplementary characters would "sort" > 0xFFFF in UTF-32 ordinally, but since the order is sort of meaningless anyway, it doesn't really matter that they sort in the surrogate range instead.
From: Allen Wirfs-Brock [mailto:allen at wirfs-brock.com]
Sent: jueves, mayo 19, 2011 3:00 PM
To: Shawn Steele
Cc: Waldemar Horwat; es-discuss at mozilla.org; Peter Constable
Subject: Re: Full Unicode strings strawman
On May 19, 2011, at 2:06 PM, Shawn Steele wrote:
> There are several sequences in Unicode which are meaningless if you have only one character and not the other. Eg: any of the variation selectors by themselves are meaningless. So if you break a modified character from its variation selector you've damaged the string. That's pretty much identical to splitting a high surrogate from its low surrogate.
> Things that UTF-32 works for without special cases:
> * Ordinal collation/sorting (eg: non-linguistic (so why is it a string?))
This is exactly my point. The string data type in ECMAScript is non-linguistic. There is noting Unicode specific about the fundamental ECMAScript string data type nor of any of the language operations (concatenation, comparison, length determination, character access) upon strings. Similarly, the majority of String method also have no specific Unicode semantic dependencies (the exception are the for toUpper/LowerCase methods and they don't treat surrogate pairs as a unit). The string data type can be used for many purposes that have nothing to do with the linguistic semantics of Unicode. That is why linguistic based arguments seem to be missing the point.
Where there is a potential connection between Unicode semantics and the string data type is in the interpretation of ECMAScript string literals as constructors of string values. ECMAScript is biased towards Unicode in the sense that it only supports a Unicode interpretation of string literals. However currently ECMAScript literals can only contain BMP characters and escape sequences that produce BMP code points and these are directly represented in string values as 16-bit character codes. Given that level of Unicode bias in the language there is obvious utility in allowing literals to contain any Unicode character and in supporting the generation of string values that use Unicode UTF-16 (and possibly alternatively UTF-8) encodings semantics. The utility of such features seems independent of the underlying size of the string type's character codes.
More information about the es-discuss