New full Unicode for ES6 idea

Allen Wirfs-Brock allen at
Mon Feb 20 09:15:31 PST 2012

On Feb 20, 2012, at 8:20 AM, Brendan Eich wrote:

> Allen Wirfs-Brock wrote:
>>> Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?
>> I think we need to be careful not to equate the syntax of ES string literals with the actual encoding space of string elements.
> I agree, which is why I'm saying with the BRS set, we should forbid "\uXXXX" since that is not a code point rather a code unit.
>>   Whether you say "\ud800" or "\u{00d800}", or call a function that does full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with string elements containing upper or lower half surrogates.
> I don't agree in the case of "\u{00d800}". That's simply an illegal code point, not a code unit (upper or lower half). We can reject it statically.


On Feb 20, 2012, at 4:19 AM, Wes Garland wrote:

> I think so, too -- especially as any sequence of Unicode code points -- including invalid and reserved code points -- constitutes a valid Unicode string, according to my recollection of the Unicode specification.

For the moment, I'll simply take Wes' word for the above, as it logically makes sense.  For some uses, you want to process all possible code points (for example, when validating data from an external source).  At this lowest level you don't want to impose higher level Unicode semantic constraints:

       if (stringFromElseWhere.indexOf("\u{d800}")) ....

>>     Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing.
> True, but not my point!

but else where you said you would reject String.fromCharCode(0xd800)

so it sounds to me like you are trying to actually ban the occurrence of 0xd800 as the value of a string element.

>> What it might do, however, is eliminate the ambiguity about the intended meaning of  "\uD800\uDc00" in legacy code.
> And arising from concatenations, avoiding the loss of Gavin's distributive .length property.

These aren't the same thing.

   "\ud8000\udc00" is a specific syntactic construct where there must have been a specific user intent in writing it. Our legacy problem is that the intent becomes ambiguous when that same sequence might be interpreted under different BRS settings.

   str1 + str2 is much less specific and all we know at runtime (assuming either str1 or str2 are strings) is that the user wants to concatenate them.   The values might be:
       str1= String.fromCharCode(0xd800);

and the user might be intentionally constructing a string containing an explicit UTF-16 encoding that is going to be passed off to an external agent that specifically requires UTF-16.

Another way to express what I see as the problem with what you are proposing about imposing such string semantics:

Could the revised ECMAScript be used to implement a language that had similar but not identical semantic rules to those you are suggested for ES strings.  My sense is that if we went down the path you are suggesting, such a implementation would have to use binary data arrays for all of its internal string processing and could not use ES string functions to process them.


More information about the es-discuss mailing list