Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Thu May 19 12:19:15 PDT 2011

On May 18, 2011, at 3:46 PM, Waldemar Horwat wrote:

> 2. Widening characters to 21 bits doesn't really help much.  As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc.  All of these require the machinery to deal with clumps of code units as though they were single characters/graphemes/etc., and once you have that machinery you can reuse it to support non-BMP characters while keeping string code units 16 bits.

Others have also made this argument, but I don't fine it particularly persuasive.  It seems to be conflating the programming language concept of a "string" with various application specific interpretations of string values and then saying that there is no reason to generalize the ECMAScript string type in one specific way (expanded the allowed code range of individual string elements) because that specific generalization does address certain application specific use cases (combining diacritic marks, grapheme selection, etc.).

Interpreting  sequences of "characters" ("characters" meaning individual string elements) within a string requires apply some sort of semantic interpretation to individual "character" values and how they relate to each other when places in sequence. This semantics is generally application or at least domain specific. In addition to the Unicode specific semantics that have been mentioned on this thread, other semantic interpretations of "character" sequences includes things like recognizing English words and sentences, lexical parsing of ECMAScript or XML tokens, processing DNA nucleotide sequence notations , etc.

I contend that none of these or any other application specific semantics should be hardwired into a language's fundamental string data type.  The purpose of the string data type is to provide a foundation abstraction that can be used to implement any higher level application semantics.

In the absences of any semantic interpretation of "characters",  language string types are most useful if they present a simple uniform model of "character" sequences. The key issue in design a language level string abstraction is the selection of the number of states that can be encoded by each "character".  In plainer words, the size of "character".  The "character" size  limits the number of individual states that can be processed by built-in string operations.  As soon as an application requires more states per unit than can be represented by a "character" the application must apply its own specific multi-"character" encoding semantics to the processing of its logical units. Note that this encoding semantics is an additional  semantic layer  that lies between the basic string data type semantics and the actual application semantics regard the interpretation of sequences of its own state units.  Such layered semantics can add significant application complexity.

So one way to look at my proposal is that is is addressing the question of whether 16-bit "characters" are enough to allow the vast majority of applications to do application domain specific semantic string processing without have to work with the complexity of intermediate encoding that exist solely to expand the logical character size.

The processing of Unicode text is a key use case of the ECMAScript string datatype.  The fundamental unit of the Unicode code space is pretty obviously  the 21-bit code points. Unicode provides domain specific encodings to use when 21-bit characters are not available but all semantics relating to character sequences seem to all be expressed in terms of these 21-bit code points.  This seems to strongly suggest that 16-bit "characters" are not enough.  Whether an application  choses to internally use a UtF-8 or UTF-16 encoding is an independent question that is probably best left to the application designer.  But if ECMAScript strings are limited to 16-bit "characters" that designer doesn't have the option of straightforward processing unencoded Unicode code points. 


More information about the es-discuss mailing list