Unicode support in new ES6 spec draft

Gillam, Richard gillam at lab126.com
Thu Jul 19 16:26:35 PDT 2012


Norbert--

>> I tend to think so.  It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.
> 
> Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.

Sorry, my mistake.  In that case, yes, you're right-- there probably isn't much value in a separate fromCodeUnit() method.

> I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.
> 
> You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.

No, we're talking past each other.  I did mean truly "random" access.  And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that.  If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character.  This feels wrong to me.

If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on.  I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this.  If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.

>> Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function?  What am I missing here?
> 
> Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:

That's what I thought.  Again, this requires that the caller know how the encoding works.  And again, maybe that's okay, but it makes me a little uncomfortable.

--Rich



More information about the es-discuss mailing list