Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Thu May 19 15:00:07 PDT 2011


On May 19, 2011, at 2:06 PM, Shawn Steele wrote:

> There are several sequences in Unicode which are meaningless if you have only one character and not the other.  Eg: any of the variation selectors by themselves are meaningless.  So if you break a modified character from its variation selector you've damaged the string.  That's pretty much identical to splitting a high surrogate from its low surrogate.  
> ...
> Things that UTF-32 works for without special cases:
> * Ordinal collation/sorting (eg: non-linguistic (so why is it a string?))

This is exactly my point.  The string data type in ECMAScript is non-linguistic.  There is noting Unicode specific about the fundamental ECMAScript  string data type nor of any of the language operations (concatenation, comparison, length determination, character access) upon strings. Similarly, the majority of String method also have no specific Unicode semantic dependencies  (the exception are the for toUpper/LowerCase methods and they don't treat surrogate pairs as a unit).  The string data type can be used for many purposes that have nothing to do with the linguistic semantics of Unicode. That is why linguistic based arguments seem to be missing the point.

Where there is a potential connection between Unicode semantics and the string data type is in the interpretation of ECMAScript string literals as constructors of string values. ECMAScript is biased towards Unicode in the sense that it only supports a Unicode interpretation of string literals. However currently ECMAScript literals can only contain BMP characters and escape sequences that produce BMP code points and these are directly represented in string values as 16-bit character codes.  Given that level of Unicode bias in the language there is obvious utility  in allowing literals to contain any Unicode character and in supporting the generation of string values that use Unicode UTF-16 (and possibly alternatively UTF-8) encodings semantics.  The utility of such features seems independent of the underlying size of the string type's character codes.

Allen






More information about the es-discuss mailing list