Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Thu May 19 14:06:44 PDT 2011


There are several sequences in Unicode which are meaningless if you have only one character and not the other.  Eg: any of the variation selectors by themselves are meaningless.  So if you break a modified character from its variation selector you've damaged the string.  That's pretty much identical to splitting a high surrogate from its low surrogate.  

The fallacy that surrogate pairs are somehow more special than other unicode character sequences is a huge problem.  Developers think they've solved "the problem" by detecting surrogate pairs, when in truth they haven't even scratched the surface of "the problem".

In the Unicode community it is well known and understood that there is little, if any, practical advantage to using UTF-32 over UTF-16.  I realize this seems counterintuitive on the surface.

Microsoft is firmly in favor of UTF-16 representation, though, as I stated before, we'd be happy if there were convenience functions, like for entering string literals.  (I can type 10000 and press alt-x in outlook to get 𐀀, which never gets stored as UTF-32, but it sure is convenient).

If there wasn't an existing body of applications using UTF-16 already, I'd be a little more open to changing, though UTF-32 would be inconvenient for us.  Since ES already, effectively, uses UTF-16, changing now just opens a can of incompatibility and special case worms.

Things that UTF-32 works for without special cases:
* Ordinal collation/sorting (eg: non-linguistic (so why is it a string?))

Things that surrogates don't really add complexity to for UTF-16:
* Linguistic sorting.
   * You want Ä (U+00C4) == Ä  (U+0041, U+0308) anyway, so you have to understand sequences.
   * Or maybe both of them equal to AE (German).
   * Many code points have no weight at all.
   * Compressions (dz) and double compressions (ddz) are far more complex.
   * etc.
* String searching, eg: you can search for A in Ä, but do you really want it to match?  
* SubString.  I found A in Äpfle, but is it really smart to break that into "A" and "̈pfle"? (there's a U+0308 before pfle, which is pretty meaningless and probably doesn't render well in everyone's font.  It makes my " look funny).  Probably this isn't very desirable.
* And that just scratches the surface

Things that don't change with UTF-16:
* People shoving binary data into UTF-16 and pretending it's a string aren't broken (assuming we allow irregular sequences for compatibility).

Things that make UTF-32 complicated:
* We already support UTF-16.  There will be an amazing number of problems around U+D800, U+DC00 ?= U+10000 if they can both be represented differently in the same string.  Some of those will be security problems.

-Shawn

-----Original Message-----
From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Allen Wirfs-Brock
Sent: jueves, mayo 19, 2011 12:19 PM
To: Waldemar Horwat
Cc: es-discuss at mozilla.org
Subject: Re: Full Unicode strings strawman


On May 18, 2011, at 3:46 PM, Waldemar Horwat wrote:

> 
> 
> 2. Widening characters to 21 bits doesn't really help much.  As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc.  All of these require the machinery to deal with clumps of code units as though they were single characters/graphemes/etc., and once you have that machinery you can reuse it to support non-BMP characters while keeping string code units 16 bits.


Others have also made this argument, but I don't fine it particularly persuasive.  It seems to be conflating the programming language concept of a "string" with various application specific interpretations of string values and then saying that there is no reason to generalize the ECMAScript string type in one specific way (expanded the allowed code range of individual string elements) because that specific generalization does address certain application specific use cases (combining diacritic marks, grapheme selection, etc.).

Interpreting  sequences of "characters" ("characters" meaning individual string elements) within a string requires apply some sort of semantic interpretation to individual "character" values and how they relate to each other when places in sequence. This semantics is generally application or at least domain specific. In addition to the Unicode specific semantics that have been mentioned on this thread, other semantic interpretations of "character" sequences includes things like recognizing English words and sentences, lexical parsing of ECMAScript or XML tokens, processing DNA nucleotide sequence notations , etc.

I contend that none of these or any other application specific semantics should be hardwired into a language's fundamental string data type.  The purpose of the string data type is to provide a foundation abstraction that can be used to implement any higher level application semantics.

In the absences of any semantic interpretation of "characters",  language string types are most useful if they present a simple uniform model of "character" sequences. The key issue in design a language level string abstraction is the selection of the number of states that can be encoded by each "character".  In plainer words, the size of "character".  The "character" size  limits the number of individual states that can be processed by built-in string operations.  As soon as an application requires more states per unit than can be represented by a "character" the application must apply its own specific multi-"character" encoding semantics to the processing of its logical units. Note that this encoding semantics is an additional  semantic layer  that lies between the basic string data type semantics and the actual application semantics regard the interpretation of sequences of its own state units.  Such layered semantics can add significant application complexity.

So one way to look at my proposal is that is is addressing the question of whether 16-bit "characters" are enough to allow the vast majority of applications to do application domain specific semantic string processing without have to work with the complexity of intermediate encoding that exist solely to expand the logical character size.

The processing of Unicode text is a key use case of the ECMAScript string datatype.  The fundamental unit of the Unicode code space is pretty obviously  the 21-bit code points. Unicode provides domain specific encodings to use when 21-bit characters are not available but all semantics relating to character sequences seem to all be expressed in terms of these 21-bit code points.  This seems to strongly suggest that 16-bit "characters" are not enough.  Whether an application  choses to internally use a UtF-8 or UTF-16 encoding is an independent question that is probably best left to the application designer.  But if ECMAScript strings are limited to 16-bit "characters" that designer doesn't have the option of straightforward processing unencoded Unicode code points. 

Allen


_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es-discuss



More information about the es-discuss mailing list