Full Unicode strings strawman

Norbert Lindenberg ecmascript at norbertlindenberg.com
Tue May 17 01:15:34 PDT 2011


I have read the discussion so far, but would like to come back to the  
strawman itself because I believe that it starts with a problem  
statement that's incorrect and misleading the discussion. Correctly  
describing the current situation would help in the discussion of  
possible changes, in particular their compatibility impact.


The relevant portion of the problem statement:

"ECMAScript currently only directly supports the 16-bit basic  
multilingual plane (BMP) subset of Unicode which is all that existed  
when ECMAScript was first designed. [...] As currently defined,  
characters in this expanded character set cannot be used in the source  
code of ECMAScript programs and cannot be directly included in runtime  
ECMAScript string values."


My reading of the ECMAScript Language Specification, edition 5.1  
(January 2011), is:

1) ECMAScript allows, but does not require, implementations to support  
the full Unicode character set.

2) ECMAScript allows source code of ECMAScript programs to contain  
characters from the full Unicode character set.

3) ECMAScript requires implementations to treat String values as  
sequences of UTF-16 code units, and defines key functionality based on  
an interpretation of String values as sequences of UTF-16 code units,  
not based on an interpretation as sequences of Unicode code points.

4) ECMAScript prohibits implementations from conforming to the Unicode  
standard with regards to case conversions.


The relevant text portions leading to these statements are:

1) Section 2, Conformance: "A conforming implementation of this  
Standard shall interpret characters in conformance with the Unicode  
Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2  
or UTF-16 as the adopted encoding form, implementation level 3. If the  
adopted ISO/IEC 10646-1 subset is not otherwise specified, it is  
presumed to be the BMP subset, collection 300. If the adopted encoding  
form is not otherwise specified, it presumed to be the UTF-16 encoding  
form."

To interpret this, note that the Unicode Standard, Version 3.1 was the  
first one to encode actual supplementary characters [1], and that the  
only difference between UCS-2 and UTF-16 is that UTF-16 supports  
supplementary characters while UCS-2 does not [2].

2) Section 6, Source Text: "ECMAScript source text is represented as a  
sequence of characters in the Unicode character encoding, version 3.0  
or later. [...] ECMAScript source text is assumed to be a sequence of  
16-bit code units for the purposes of this specification. [...] If an  
actual source text is encoded in a form other than 16-bit code units  
it must be processed as if it was first converted to UTF-16."

To interpret this, note again that the Unicode Standard, Version 3.1  
was the first one to encode actual supplementary characters, and that  
the conversion requirement enables the use of supplementary characters  
represented as 4-byte UTF-8 characters in source text. As UTF-8 is now  
the most commonly used character encoding on the web [3], the 4-byte  
UTF-8 representation, not Unicode escape sequences, should be seen as  
the normal representation of supplementary characters in ECMAScript  
source text.

3) Section 6, Source Text: "If an actual source text is encoded in a  
form other than 16-bit code units it must be processed as if it was  
first converted to UTF-16. [...] Throughout the rest of this document,  
the phrase “code unit” and the word “character” will be used to  
refer to a 16-bit unsigned value used to represent a single 16-bit  
unit of text." Section 15.5.4.4, String.prototype.charCodeAt(pos):  
"Returns a Number (a nonnegative integer less than 2**16) representing  
the code unit value of the character at position pos in the String  
resulting from converting this object to a String." Section 15.5.5.1  
length: "The number of characters in the String value represented by  
this String object."

I don't like that the specification redefines a commonly used term  
such as "character" to mean something quite different ("code unit"),  
and hides that redefinition in a section on source text while applying  
it primarily to runtime behavior. But there it is: Thanks to the  
redefinition, it's clear that charCodeAt() returns UTF-16 code units,  
and that the length property holds the number of UTF-16 code units in  
the string.

4) Section 15.5.4.16, String.prototype.toLowerCase(): "For the  
purposes of this operation, the 16-bit code units of the Strings are  
treated as code points in the Unicode Basic Multilingual Plane.  
Surrogate code points are directly transferred from S to L without any  
mapping."

This does not meet Conformance Requirement C8 of the Unicode Standard,  
Version 6.0 [4]: "When a process interprets a code unit sequence which  
purports to be in a Unicode character encoding form, it shall  
interpret that code unit sequence according to the corresponding code  
point sequence."


References:

[1] http://www.unicode.org/reports/tr27/tr27-4.html
[2] http://www.unicode.org/glossary/#U
[3] as Mark Davis reported at the Unicode Conference 2010
[4] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf


Best regards,
Norbert



On May 16, 2011, at 11:11 , Allen Wirfs-Brock wrote:

> I tried to post a pointer to this strawman on this list a few weeks  
> ago, but apparently it didn't reach the list for some reason.
>
> Feed back would be appreciated:
>
> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
>
> Allen



More information about the es-discuss mailing list