Unicode support in new ES6 spec draft

Gillam, Richard gillam at lab126.com
Mon Jul 16 14:54:07 PDT 2012


Commenting on Norbert's comments…

> Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."

I like it.

> We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.

Points taken.  I'll withdraw my suggestion of "Unicode scalar value."  As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it.  (The fact that Unicode itself doesn't have a formal definition of "character" helps.)

"Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.

> This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.

Fair enough.

> Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000.  This seems like the right way to go.

> fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

> codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.

I still think it helps.

> I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.

I'd have to go back and look at the doc, but yeah: I probably mean codePointAt().

>>> - p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?
>> 
>> this is covered by the I18N API spec. Right?
> 
> It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.

I agree.

>>> [I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.
>> 
>> Norbert proposed this function so we should get his thoughts on the addressing issue.  As I wrote this I did think a bit about whether or not we need to provide some support for  backward iteration over strings.
> 
> Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

Why is it intentional?  I don't see the value in restricting it.  You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case.  What about the random-access case?  Is there no such case?  Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get *the wrong answer*, and that seems dangerous.  [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]

--Rich


> There are more issues with this function, which I'll comment on separately.
> 



More information about the es-discuss mailing list