Unicode support in new ES6 spec draft

Norbert Lindenberg ecmascript at norbertlindenberg.com
Tue Jul 17 13:05:57 PDT 2012

And more comments…

On Jul 16, 2012, at 14:54 , Gillam, Richard wrote:

> Commenting on Norbert's comments…


>> We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.
> Points taken.  I'll withdraw my suggestion of "Unicode scalar value."  As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it.  (The fact that Unicode itself doesn't have a formal definition of "character" helps.)
> "Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.

In bug 524, I softened this a bit: "The term "Unicode character" can be used when only assigned characters are meant, e.g., when referring to individual characters such as "comma" or "reverse solidus", or to the characters that can be used in identifiers.

>> fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
> But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

True. Is this important enough to application developers to warrant a separate method?

>>>> [I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.
>>> Norbert proposed this function so we should get his thoughts on the addressing issue.  As I wrote this I did think a bit about whether or not we need to provide some support for  backward iteration over strings.
>> Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
> Why is it intentional?  I don't see the value in restricting it.  You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case.  What about the random-access case?  Is there no such case?  Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get *the wrong answer*, and that seems dangerous.  [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]

I think the question is the other way round: Is there a valid and common use case that requires random access?

More information about the es-discuss mailing list