Unicode support in new ES6 spec draft

Norbert Lindenberg ecmascript at norbertlindenberg.com
Thu Jul 19 16:16:13 PDT 2012


On Jul 18, 2012, at 19:42 , Gillam, Richard wrote:

>>> But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?
>> 
>> True. Is this important enough to application developers to warrant a separate method?
> 
> I tend to think so.  It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.

Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.

>>>> Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
>>> 
>>> Why is it intentional?  I don't see the value in restricting it.  You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case.  What about the random-access case?  Is there no such case?  Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get *the wrong answer*, and that seems dangerous.  [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]
>> 
>> I think the question is the other way round: Is there a valid and common use case that requires random access?
> 
> If there isn't, then I don't think we should have an API for random access; we should just have iterators.  This API looks like it supports random access, but it really doesn't.

I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.

You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.

> Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function?  What am I missing here?

Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String



More information about the es-discuss mailing list