Unicode support in new ES6 spec draft

Norbert Lindenberg ecmascript at norbertlindenberg.com
Mon Jul 23 20:10:03 PDT 2012


On Jul 19, 2012, at 16:26 , Gillam, Richard wrote:

>> I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.
>> 
>> You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.
> 
> No, we're talking past each other.  I did mean truly "random" access.  And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that.  If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character.  This feels wrong to me.
> 
> If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on.  I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this.  If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.
> 
>>> Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function?  What am I missing here?
>> 
>> Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:
> 
> That's what I thought.  Again, this requires that the caller know how the encoding works.  And again, maybe that's okay, but it makes me a little uncomfortable.

On both of the above: If an application uses indices into UTF-16 strings, its developers have to understand UTF-16. That's why Allen was pushing for UTF-32, but in the choice between making it easier for developers and maintaining compatibility with existing code compatibility won.

Hopefully we can simplify things at a higher level of abstraction, e.g., by providing a richer API for iterators.

Norbert



More information about the es-discuss mailing list