Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Tue May 17 09:43:41 PDT 2011


On May 17, 2011, at 1:15 AM, Norbert Lindenberg wrote:

> I have read the discussion so far, but would like to come back to the strawman itself because I believe that it starts with a problem statement that's incorrect and misleading the discussion. Correctly describing the current situation would help in the discussion of possible changes, in particular their compatibility impact.

Since I was editor of the ES5 specification and the person who wrote or updated this language I can tell you the actual intent.  

Clause 6 of Edition 3 referenced Unicode version 2.1 and used the words "codepoint" and "character" interchangeably to refer to "a 16-bit unsigned value used to represent a single 16-bit unit of UTF-16 text".   It also defined "Unicode character" to mean "...a single Unicode scalar value (which may be longer than 16-bits...". and defined the production
   SourcerCharacter :: 
       any Unicode Character

Regardless of these definition the rest of the Edition 3 specification generally treated SourceCharacter  as being equivalent to a 16-bit  "character".  For example, the elements of a StringLiteral are SourceCharacters but in defining how  string values are constructed from these literals each SourceCharacter maps to a single 16-bit character element of the string value.  There is no dissuasion of how to handle SourceCharacters whose Unicode scalar value is larger than 0xffff. 

In practice browser JavaScript implementations of this era processed source input as if it was  UCS-2.  They did not recognize surrogate pairs as representing single Unicode character SourceCharacters.  

In drafting the ES5 spec, TC39 had two goals WRT character encoding.  We wanted to allow the occurrences of (BMP) characters defined in Unicode versions beyond 2.1 and we wanted to update the specification to reflect actual implementation reality that source was processed as if it was UCS-2. We updated the Unicode reference to "3.0 or later".   More importantly, we changed the definition of SourceCharacter to 
   SourcerCharacter :: 
       any Unicode code unit
and generally changed all uses of the term "codepoint" to "code unit". Other editorial changes were also made to clarify that the  alphabet of ES5 are 16-bit code units.

This editorial update was probably incomplete and some vestiges of the old language still remains.  It other cases the language may not be technically correct, according to the official Unicode vocabulary. Regardless, that was the intent behind these changes.  To make the ES5 spec. match the current web reality that JavaScript implementations process source code as if it exists UCS-2 world.

The ES5 specification language clearly still has issues WRT Unicode encoding of programs and strings. These need to be fixed in the next edition. However, interpreting the current language as allow supplemental characters to occur in program text and particularly string literals doesn't match either reality or the intent of the ES5 spec. Changing the specification to allow such occurrences and propagating them as surrogate pairs would seem to raise similar backwards compatibility concerns to those that have been raise about my proposal.  The status quo would be to simply further clarify that the ES language processes programs as if they are UCS-2 encoded.  I would prefer to move the language in a direction that doesn't have this restriction.  The primary area of concern will be the handling of supplemental characters in string literals.  Interpreting them either as single elements of UCS-32 encoded string or as UTF-16 pairs in the existing ES strings would be a change from current practice. The impact of both approaches need to be better understood.


> 
> 
> The relevant portion of the problem statement:
> 
> "ECMAScript currently only directly supports the 16-bit basic multilingual plane (BMP) subset of Unicode which is all that existed when ECMAScript was first designed. [...] As currently defined, characters in this expanded character set cannot be used in the source code of ECMAScript programs and cannot be directly included in runtime ECMAScript string values."
> 
> 
> My reading of the ECMAScript Language Specification, edition 5.1 (January 2011), is:
> 
> 1) ECMAScript allows, but does not require, implementations to support the full Unicode character set.
> 
> 2) ECMAScript allows source code of ECMAScript programs to contain characters from the full Unicode character set.
> 
> 3) ECMAScript requires implementations to treat String values as sequences of UTF-16 code units, and defines key functionality based on an interpretation of String values as sequences of UTF-16 code units, not based on an interpretation as sequences of Unicode code points.
> 
> 4) ECMAScript prohibits implementations from conforming to the Unicode standard with regards to case conversions.
> 
> 
> The relevant text portions leading to these statements are:
> 
> 1) Section 2, Conformance: "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form."
> 
> To interpret this, note that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters [1], and that the only difference between UCS-2 and UTF-16 is that UTF-16 supports supplementary characters while UCS-2 does not [2].

Other than changing "2.1" to "3.0" this text was not updated from Edition 3.  It probably should have been. The original text was written at a time where, in practice, UCS-2 and UTF-16 meant the same thing.

The ES5 motivation for changing "2.1" to "3.0" was not to allow for the use of supplementary characters.  As described  above, ES5 was attempting to clarify that supplementary characters are not recognized.  


> 
> 2) Section 6, Source Text: "ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16."
> 
> To interpret this, note again that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters, and that the conversion requirement enables the use of supplementary characters represented as 4-byte UTF-8 characters in source text. As UTF-8 is now the most commonly used character encoding on the web [3], the 4-byte UTF-8 representation, not Unicode escape sequences, should be seen as the normal representation of supplementary characters in ECMAScript source text.
This was not the intent.  Note that SourceCharacter was redefined as  "code unit"


> 
> 3) Section 6, Source Text: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16. [...] Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text." Section 15.5.4.4, String.prototype.charCodeAt(pos): "Returns a Number (a nonnegative integer less than 2**16) representing the code unit value of the character at position pos in the String resulting from converting this object to a String." Section 15.5.5.1 length: "The number of characters in the String value represented by this String object."
> 
> I don't like that the specification redefines a commonly used term such as "character" to mean something quite different ("code unit"), and hides that redefinition in a section on source text while applying it primarily to runtime behavior. But there it is: Thanks to the redefinition, it's clear that charCodeAt() returns UTF-16 code units, and that the length property holds the number of UTF-16 code units in the string.


Clause 6 defines "code unit", not "UTF-16 code unit".  15.5.4.4 charCodeAt returns a the integer encoding of a "code unit".  Nothing is said about it having any correspondence to UTF-16.

> 
> 4) Section 15.5.4.16, String.prototype.toLowerCase(): "For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping."
> 
> This does not meet Conformance Requirement C8 of the Unicode Standard, Version 6.0 [4]: "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence."

This is an explicit reflection that all character processing In ES5 (excepted for the URI encode/decode functions) exclusive process 16-bit code units and do not recognize or process UTF-16 encoded surrogate pairs. 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/fbf1fc9d/attachment.html>


More information about the es-discuss mailing list