Code points vs Unicode scalar values

Brendan Eich brendan at mozilla.com
Wed Sep 4 09:34:09 PDT 2013


> Anne van Kesteren <mailto:annevk at annevk.nl>
> September 4, 2013 9:06 AM
> On Wed, Sep 4, 2013 at 4:58 PM, Brendan Eich<brendan at mozilla.com>  wrote:
>> String.fromCodePoint, rather.
>
> Oops. Any reason this is not just String.from() btw? Give the better
> method a nice short name?

Because of String.fromCharCode precedent. Balanced names with noun 
phrases that distinguish the "from" domains are better than 
longAndPortly vs. tiny.


>>> I'm not sure I'm a big fan of having all three concepts around.
>> You can't avoid it: UTF-8 is a transfer format that can be observed via
>> serialization.
>
> Yes, but it cannot encode lone surrogates. It can only deal in Unicode
> scalar values.

Sure, but you wanted to reduce "three concepts" and I don't see how to 
do that. Most developers can ignore UTF-8, for sure.

Probably I just misunderstood what you meant, and you were simply 
pointing out that lone surrogates arise only from legacy APIs?
>
>
>> String.prototype.charCodeAt and String.fromCharCode are
>> required for backward compatibility. And ES6 wants to expose code points as
>> well, so three.
>
> Unicode scalar values are code points sans surrogates, i.e. completely
> compatible with what a utf-8 encoder/decoder pair can handle.
>
> Why do you want to expose surrogates?

I'm not sure I do! Sounds scandalous. :-P

Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint ( 
...codePoints):

The String.fromCodePoint function may be called with a variable number 
of arguments which form the
rest parameter codePoints. The following steps are taken:
1. Assert: codePoints is a well-formed rest parameter object.
2. Let length be the result of Get(codePoints, "length").
3. Let elements be a new List.
4. Let nextIndex be 0.
5. Repeat while nextIndex < length
a. Let next be the result of Get(codePoints, ToString(nextIndex)).
b. Let nextCP be ToNumber(next).
c. ReturnIfAbrupt(nextCP).
d. If SameValue(nextCP, ToInteger(nextCP)) is false,then throw a 
RangeError exception.
e. If nextCP < 0 or nextCP > 0x10FFFF, then throw a RangeError exception.
f. Append the elements of the UTF-16 Encoding (clause 6) of nextCP to 
the end of elements.
g. Let nextIndex be nextIndex + 1.
6. Return the String value whose elements are, in order, the elements in 
the List elements. If length is 0, the
empty string is returned.


No exposed surrogates here!

Here's the spec for String.prototype.codePointAt:

When the codePointAt method is called with one argument pos, the 
following steps are taken:
1. Let O be CheckObjectCoercible(this value).
2. Let S be ToString(O).
3. ReturnIfAbrupt(S).
4. Let position be ToInteger(pos).
5. ReturnIfAbrupt(position).
6. Let size be the number of elements in S.
7. If position < 0 or position ≥ size, return undefined.
8. Let first be the code unit value of the element at index position in 
the String S.
9. If first < 0xD800 or first > 0xDBFF or position+1 = size, then return 
first.
10. Let second be the code unit value of the element at index position+1 
in the String S.
11. If second < 0xDC00 or second > 0xDFFF, then return first.
12. Return ((first – 0xD800) × 1024) + (second – 0xDC00) + 0x10000.
NOTE The codePointAt function is intentionally generic; it does not 
require that its this value be a String object.
Therefore it can be transferred to other kinds of objects for use as a 
method.


I take it you are objecting to step 11?


>> Sorry, I missed this: how else (other than the charCodeAt/fromCharCode
>> legacy) are lone surrogates exposed?
>
> "\udfff".codePointAt(0) == "\udfff"
>
> It seems better if that returns "\ufffd", as you'd get with utf-8
> (assuming it accepts code points as input rather than just Unicode
> scalar values, in which case it'd throw).

Maybe. Allen and Norbert should weigh in.
>
> The indexing of codePointAt() is also kind of sad as it just passes
> through to charCodeAt(),

I don't see that in the spec cited above.

/be

>   which means for any serious usage you need to
> use the iterator anyway. What's the reason codePointAt() exists?
>
>
> Brendan Eich <mailto:brendan at mozilla.com>
> September 4, 2013 8:58 AM
>
>
>> Anne van Kesteren <mailto:annevk at annevk.nl>
>> September 4, 2013 7:48 AM
>> ES6 introduces String.prototype.codePointAt() and
>> String.codePointFrom()
>
> String.fromCodePoint, rather.
>
>> as well as an iterator (not defined). It struck
>> me this is the only place in the platform where we'd expose code point
>> as a concept to developers.
>>
>> Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.)
>> or Unicode scalar values (anytime you hit the network and use utf-8).
>>
>> I'm not sure I'm a big fan of having all three concepts around.
>
> You can't avoid it: UTF-8 is a transfer format that can be observed 
> via serialization. String.prototype.charCodeAt and String.fromCharCode 
> are required for backward compatibility. And ES6 wants to expose code 
> points as well, so three.
>
>> We
>> could have String.prototype.unicodeAt() and String.unicodeFrom()
>> instead, and have them translate lone surrogates into U+FFFD. Lone
>> surrogates are a bug and I don't see a reason to expose them in more
>> places than just the 16-bit code units.
>
> Sorry, I missed this: how else (other than the charCodeAt/fromCharCode 
> legacy) are lone surrogates exposed?
>
> /be
>>
>>
> Anne van Kesteren <mailto:annevk at annevk.nl>
> September 4, 2013 7:48 AM
> ES6 introduces String.prototype.codePointAt() and
> String.codePointFrom() as well as an iterator (not defined). It struck
> me this is the only place in the platform where we'd expose code point
> as a concept to developers.
>
> Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.)
> or Unicode scalar values (anytime you hit the network and use utf-8).
>
> I'm not sure I'm a big fan of having all three concepts around. We
> could have String.prototype.unicodeAt() and String.unicodeFrom()
> instead, and have them translate lone surrogates into U+FFFD. Lone
> surrogates are a bug and I don't see a reason to expose them in more
> places than just the 16-bit code units.
>
>


More information about the es-discuss mailing list