Unicode support in new ES6 spec draft

Norbert Lindenberg ecmascript at norbertlindenberg.com
Wed Jul 11 12:31:19 PDT 2012


I haven't reviewed the new spec draft in detail yet, but have some comments on the comments from Rich and Allen - see below.

Norbert


On Jul 10, 2012, at 20:53 , Allen Wirfs-Brock wrote:

> 
> On Jul 10, 2012, at 7:50 PM, Gillam, Richard wrote:
> 
>> Allen--
>> 
>> A few comments on the i18n/Unicode-related stuff in the latest draft:
>> 
>> - p. 1, §2: It seems a little weird here to be specifying a particular version of the Unicode standard but not of ISO 10646.  Down in section 3, you _do_ nail down the version of 10646 and it's long, so I can see why you don't want all this verbiage in section 2 as well, but maybe you want more than you have?
> 
> i'm not sure.  This was Norbert recommendation.  I'm liked want we did in ES5 where we specified version 3.0 as minimum version for which there was guaranteed interoperability but allowed use of more recent versions.  Input on what would make the most sense is appreciated.

Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."

>> - p. 14 §6: More substantively, do you really need to go into this level of detail as to what a "Unicode character" is?  I would think you could say something like "ECMAScript source text is a sequence of Unicode abstract code point values (or, in this spec, "Unicode characters").  The actual representation of those characters in bits (e.g., UTF-16 or UTF-32 or even a non-Unicode encoding) is implementation-dependent, but a conforming implementation must process source text as if it were an equivalent sequence of SourceCharacter values."  I think that for the purposes of this spec, how "Unicode code point" maps to a normal human's idea of "character" is irrelevant; you can define "character" to mean the same thing as Unicode means when it says "code point" and be done with it.  (This probably means you can ether get rid of the next paragraph, or at least that that paragraph is entirely informative.)
> 
> First, I have so say that there will probably be some controversy about this section in TC39.  Norbert proposal was that we specify SourceCharacter as always being UTF-16 encoded, while I went in the direction of essentially defining it as abstract characters  identified by code points. No doubt  there will be additional discussion about this.
> 
> There is enough confusion concerning ECMAScript source code (code point vs code unit, UTF-16 or not, etc.) in previous editions, that I wanted to be as clear as possible.  The key point is just that the ECMAScript specification is assigning a meanings to certain Unicode characters/character sequences and that meaning is independent of  file encoding processing that may take place within an implementation.

I think basing the specification on UTF-16 code units as source code would be easier, but using Unicode code points as the basis isn't wrong either.

We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.

For "character", people have different ideas what the term means, and redefining it, as ES5 did, would just add to the confusion.

"Unicode character" is not defined in the Unicode standard, as far as I can tell, but seems to be used in the sense of "code point assigned to abstract character" or possibly "designated code point". With either definition, it would exclude code points reserved for future assignment, such as characters that were added in Unicode 6.1 if your implementation was based on Unicode 5.1. Such a restriction would be a constant source of interoperability problems.

"Unicode scalar value" is defined in the Unicode standard as "Any Unicode code point except high-surrogate and low-surrogate code points." We cannot exclude surrogate code points from source code, as this would break compatibility with existing code.

"Unicode code point" and "UTF-16 code unit" are the terms we have to use most of the time.

I agree with Rich that we should limit the discussion to what's relevant to the spec.

>> - p. 19, §7.6: I tend to agree with your comment here-- since this was nailed to Unicode 3.0 before, it seems better to stick with that when we're talking about "portability" (although a note explaining why it's not Unicode 5.1 might be helpful).

I disagree. Unicode 5.1 support is part of ES6 just like the "let" and "class" keywords. I assume we're not going to tell programmers to stay away from "let" and "class". Why should we tell them to stay away from Unicode 5.1?

This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.

>> - p. 24, §7.8.4: In earlier versions of ECMAScript, I could often specify a supplementary-plane character by using two Unicode escape sequences in a row, each representing a surrogate code unit value.  Can I still do that?  It seems like you'd have to support this for backward compatibility, but you're not really supposed to see bare surrogates in any context except for UTF-16 (I don't think they're strictly illegal, except in UTF-8, but the code point sequence <D800 DC00> isn't equivalent to U+10000, either.  I think you want some verbiage here clarifying how this is supposed to work. A \uNNNN escape is a BMP code point so it will always contribute exactly one element to the string value. 
> 
> I believe that the algorithmic text of the spec. is clear in this regard.  But informative text could be added.
> 
> What actually happens idepends upon contextual details that your 4th item refers to.
> 
> In a string literal, each non-escape sequence character contributes one code point to the literal.  String values are made up of 16-bit elements, with non-BMP code points being UTF-16 encoded as two elements.   A \u{ } escape also represents one code point that, depending upon its value, will contribute one or two elements to the string value.  A \uNNNN escape represents a BMP code point so it is always represented as one string element.  If you have existing code that like " \ud800\udc00" you will get the same two element string value that you would get if you wrote "\u{10000}" or \u{d800}\u{dc00}".  They are all alternative ways of expressing the same two element string value.  The first form must be supported for backwards compatibility. 
> 
> Outside (I'm actually glossing over a couple of other exceptions) of string literals (and friends, such as quasis) we don't have to worry about UTF-16 encoding because we are dealing with more abstract concepts, such as "identifiers" that we can deal with at the level of Unicode characters.  We also don't have backwards compat. issues because in those contexts current implementations look at surrogate  pairs as two distinct "characters", either of which result in syntax errors in all non-literal context.  EG,  current implementations reject identifiers containing supplementary characters that are (according to Unicode) legal identifier characters. 

Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

>> - p. 210, §15.5.3.2: I like the idea of introducing fromCodeUnit(), making this function an alias of that one, and marking this function as obsolete.  But I'm also wondering if it would make more sense for this function to be called fromCodeUnits(), since you can specify a whole list of code units, and they all contribute to the string.
> 
> singular form is following the convention established by fromCharCode.  It could change if nobody objects.

fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

>> - p. 210, §15.5.3.3: Same thing: Maybe call this fromCodePoints()?  [Note also you have a copy-and-paste problem on the first line, where it still says "String.fromCharCode()".]  

fromCodePoints would be fine. There are a few more copy-and-paste references to codeUnits, and a "codePoint" missing an "s".

>> - p. 212, §15.5.4.4: I like the idea of adding a new name for this function, but I'm thinking maybe codeUnitAt().  Or do what Java did (IIRC): Add a new function called char32At(), which behaves like this one, except that if the index you give it points to one half of a surrogate pair, you return the code point value represented by the surrogate pair.  (If you don't do some sort of char32At() function, you're probably going to need a function that takes a sequence of UTF-16 code unit values and returns a sequence of Unicode code point values.)
> 
> I also have thought about unicodeCharAt() or perhaps uCharAt().  

codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.

I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.

>> - p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?
> 
> this is covered by the I18N API spec. Right?

It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.

>> - p. 223, §15.5.4.5: First, did something go haywire with the numbering here?  Second, this sort of addresses my comment above, but if you can't put this and charCodeAt() (or whatever we called it) together in the spec, can you include a pointer in charCodeAt()'s description to here?  Third, it looks like this only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.  (I think you may have a typo in step 11 as well: shouldn't that be "…or second > 0xDFFF"?)
> 
> It's supposed to be 15.5.4.25. also yes about the step 11 typo.  Norbert proposed this function so we should get his thoughts on the addressing issue.  As I wrote this I did think a bit about whether or not we need to provide some support for  backward iteration over strings.

At some point we have to give chapter 15 a logical structure again, rather than just offering sediment layers.

Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

There are more issues with this function, which I'll comment on separately.



More information about the es-discuss mailing list