Code points vs Unicode scalar values

Brendan Eich brendan at mozilla.com
Thu Sep 5 14:08:29 PDT 2013


Thanks for the reminders -- we've been over this.

/be

> Norbert Lindenberg <mailto:ecmascript at lindenbergsoftware.com>
> September 5, 2013 12:07 PM
>
> Previous discussion of allowing surrogate code points:
> https://mail.mozilla.org/pipermail/es-discuss/2012-December/thread.html#27057 
>
> https://mail.mozilla.org/pipermail/es-discuss/2013-January/thread.html#28086 
>
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29
>
> Essentially, ECMAScript strings are Unicode strings as defined in The 
> Unicode Standard section 2.7, and thus may contain unpaired surrogate 
> code units in their 16-bit form or surrogate code points when 
> interpreted as 32-bit sequences. String.fromCodePoint and 
> String.prototype.codePointAt just convert between 16-bit and 32-bit 
> forms; they're not meant to interpret the code points beyond that, and 
> some processing (such as test cases) may depend on them being 
> preserved. This is different from encoding for communication over 
> networks, where the use of valid UTF-8 or UTF-16 (which cannot contain 
> surrogate code points) is generally required.
>
> The indexing issue was first discussed in the form "why can't we just 
> use UTF-32"? See
> http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32 
>
> for pointers to that. It would have been great to use UTF-8, but it's 
> unfortunately not compatible with the past and the DOM.
>
> Adding code point indexing to 16-bit code unit strings would add 
> significant performance overhead. In reality, whether an index is for 
> 16-bit or 32-bit units matters only for some relatively low-level 
> software that needs to process code point by code point. A lot of 
> software deals with complete strings without ever looking inside, or 
> is fine processing code unit by code unit (e.g., 
> String.prototype.indexOf).
>
> Norbert
> Brendan Eich <mailto:brendan at mozilla.com>
> September 4, 2013 2:28 PM
>
>
> 8. Let first be the code unit value of the element at index position 
> in the String S.
>
> This does not "[pass] through to charCodeAt()" literally, which would 
> mean a call to S.charCodeAt(position). I thought that's what you meant.
>
> So you want a code point index, not a code unit index. That would not 
> be useful for the lower-level purposes Allen identified. Again it 
> seems you're trying to abstract away from all the details that 
> probably will matter for string hackers using these APIs. But I summon 
> Norbert at this point!
>
> /be
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
> Anne van Kesteren <mailto:annevk at annevk.nl>
> September 4, 2013 12:51 PM
> On Wed, Sep 4, 2013 at 5:34 PM, Brendan Eich<brendan at mozilla.com> wrote:
>> Because of String.fromCharCode precedent. Balanced names with noun 
>> phrases
>> that distinguish the "from" domains are better than longAndPortly vs. 
>> tiny.
>
> I kinda liked it as analogue to what exists for Array and because
> developers should probably move away from fromCharCode so the
> precedent does not matter that much.
>
>
>> Sure, but you wanted to reduce "three concepts" and I don't see how 
>> to do
>> that. Most developers can ignore UTF-8, for sure.
>
> The three concepts are: 16-bit code units, code points, and Unicode
> scalar values. JavaScript, DOM, etc. deal with 16-bit code units.
> utf-8 et al deal with Unicode scalar values. Nothing, apart from this
> API, does code points at the moment.
>
>
>> Probably I just misunderstood what you meant, and you were simply 
>> pointing
>> out that lone surrogates arise only from legacy APIs?
>
> No, they arise from this API.
>
>
>> Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (
>> ...codePoints):
>>
>> No exposed surrogates here!
>
> Mathias covered this.
>
>
>> Here's the spec for String.prototype.codePointAt:
>>
>> 8. Let first be the code unit value of the element at index position 
>> in the
>> String S.
>> 11. If second< 0xDC00 or second> 0xDFFF, then return first.
>>
>> I take it you are objecting to step 11?
>
> And step 8. The indexing is based on code units so you cannot actually
> do indexing easily. You'd need to use the iterator to iterate over a
> string getting only code points out.
>
>
>>> The indexing of codePointAt() is also kind of sad as it just passes
>>> through to charCodeAt(),
>> I don't see that in the spec cited above.
>
> How do you read step 8?
>
>
> Brendan Eich <mailto:brendan at mozilla.com>
> September 4, 2013 9:34 AM
>> Anne van Kesteren <mailto:annevk at annevk.nl>
>> September 4, 2013 9:06 AM
>> On Wed, Sep 4, 2013 at 4:58 PM, Brendan Eich<brendan at mozilla.com> wrote:
>>> String.fromCodePoint, rather.
>>
>> Oops. Any reason this is not just String.from() btw? Give the better
>> method a nice short name?
>
> Because of String.fromCharCode precedent. Balanced names with noun 
> phrases that distinguish the "from" domains are better than 
> longAndPortly vs. tiny.
>
>
>>>> I'm not sure I'm a big fan of having all three concepts around.
>>> You can't avoid it: UTF-8 is a transfer format that can be observed via
>>> serialization.
>>
>> Yes, but it cannot encode lone surrogates. It can only deal in Unicode
>> scalar values.
>
> Sure, but you wanted to reduce "three concepts" and I don't see how to 
> do that. Most developers can ignore UTF-8, for sure.
>
> Probably I just misunderstood what you meant, and you were simply 
> pointing out that lone surrogates arise only from legacy APIs?
>>
>>
>>> String.prototype.charCodeAt and String.fromCharCode are
>>> required for backward compatibility. And ES6 wants to expose code 
>>> points as
>>> well, so three.
>>
>> Unicode scalar values are code points sans surrogates, i.e. completely
>> compatible with what a utf-8 encoder/decoder pair can handle.
>>
>> Why do you want to expose surrogates?
>
> I'm not sure I do! Sounds scandalous. :-P
>
> Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint ( 
> ...codePoints):
>
> The String.fromCodePoint function may be called with a variable number 
> of arguments which form the
> rest parameter codePoints. The following steps are taken:
> 1. Assert: codePoints is a well-formed rest parameter object.
> 2. Let length be the result of Get(codePoints, "length").
> 3. Let elements be a new List.
> 4. Let nextIndex be 0.
> 5. Repeat while nextIndex < length
> a. Let next be the result of Get(codePoints, ToString(nextIndex)).
> b. Let nextCP be ToNumber(next).
> c. ReturnIfAbrupt(nextCP).
> d. If SameValue(nextCP, ToInteger(nextCP)) is false,then throw a 
> RangeError exception.
> e. If nextCP < 0 or nextCP > 0x10FFFF, then throw a RangeError exception.
> f. Append the elements of the UTF-16 Encoding (clause 6) of nextCP to 
> the end of elements.
> g. Let nextIndex be nextIndex + 1.
> 6. Return the String value whose elements are, in order, the elements 
> in the List elements. If length is 0, the
> empty string is returned.
>
>
> No exposed surrogates here!
>
> Here's the spec for String.prototype.codePointAt:
>
> When the codePointAt method is called with one argument pos, the 
> following steps are taken:
> 1. Let O be CheckObjectCoercible(this value).
> 2. Let S be ToString(O).
> 3. ReturnIfAbrupt(S).
> 4. Let position be ToInteger(pos).
> 5. ReturnIfAbrupt(position).
> 6. Let size be the number of elements in S.
> 7. If position < 0 or position ≥ size, return undefined.
> 8. Let first be the code unit value of the element at index position 
> in the String S.
> 9. If first < 0xD800 or first > 0xDBFF or position+1 = size, then 
> return first.
> 10. Let second be the code unit value of the element at index 
> position+1 in the String S.
> 11. If second < 0xDC00 or second > 0xDFFF, then return first.
> 12. Return ((first – 0xD800) × 1024) + (second – 0xDC00) + 0x10000.
> NOTE The codePointAt function is intentionally generic; it does not 
> require that its this value be a String object.
> Therefore it can be transferred to other kinds of objects for use as a 
> method.
>
>
> I take it you are objecting to step 11?
>
>
>>> Sorry, I missed this: how else (other than the charCodeAt/fromCharCode
>>> legacy) are lone surrogates exposed?
>>
>> "\udfff".codePointAt(0) == "\udfff"
>>
>> It seems better if that returns "\ufffd", as you'd get with utf-8
>> (assuming it accepts code points as input rather than just Unicode
>> scalar values, in which case it'd throw).
>
> Maybe. Allen and Norbert should weigh in.
>>
>> The indexing of codePointAt() is also kind of sad as it just passes
>> through to charCodeAt(),
>
> I don't see that in the spec cited above.
>
> /be
>
>> which means for any serious usage you need to
>> use the iterator anyway. What's the reason codePointAt() exists?
>>
>>
>> Brendan Eich <mailto:brendan at mozilla.com>
>> September 4, 2013 8:58 AM
>>
>>
>>> Anne van Kesteren <mailto:annevk at annevk.nl>
>>> September 4, 2013 7:48 AM
>>> ES6 introduces String.prototype.codePointAt() and
>>> String.codePointFrom()
>>
>> String.fromCodePoint, rather.
>>
>>> as well as an iterator (not defined). It struck
>>> me this is the only place in the platform where we'd expose code point
>>> as a concept to developers.
>>>
>>> Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.)
>>> or Unicode scalar values (anytime you hit the network and use utf-8).
>>>
>>> I'm not sure I'm a big fan of having all three concepts around.
>>
>> You can't avoid it: UTF-8 is a transfer format that can be observed 
>> via serialization. String.prototype.charCodeAt and 
>> String.fromCharCode are required for backward compatibility. And ES6 
>> wants to expose code points as well, so three.
>>
>>> We
>>> could have String.prototype.unicodeAt() and String.unicodeFrom()
>>> instead, and have them translate lone surrogates into U+FFFD. Lone
>>> surrogates are a bug and I don't see a reason to expose them in more
>>> places than just the 16-bit code units.
>>
>> Sorry, I missed this: how else (other than the 
>> charCodeAt/fromCharCode legacy) are lone surrogates exposed?
>>
>> /be
>>>
>>>
>> Anne van Kesteren <mailto:annevk at annevk.nl>
>> September 4, 2013 7:48 AM
>> ES6 introduces String.prototype.codePointAt() and
>> String.codePointFrom() as well as an iterator (not defined). It struck
>> me this is the only place in the platform where we'd expose code point
>> as a concept to developers.
>>
>> Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.)
>> or Unicode scalar values (anytime you hit the network and use utf-8).
>>
>> I'm not sure I'm a big fan of having all three concepts around. We
>> could have String.prototype.unicodeAt() and String.unicodeFrom()
>> instead, and have them translate lone surrogates into U+FFFD. Lone
>> surrogates are a bug and I don't see a reason to expose them in more
>> places than just the 16-bit code units.
>>
>>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
> Anne van Kesteren <mailto:annevk at annevk.nl>
> September 4, 2013 9:06 AM
> On Wed, Sep 4, 2013 at 4:58 PM, Brendan Eich<brendan at mozilla.com> wrote:
>> String.fromCodePoint, rather.
>
> Oops. Any reason this is not just String.from() btw? Give the better
> method a nice short name?
>
>
>>> I'm not sure I'm a big fan of having all three concepts around.
>> You can't avoid it: UTF-8 is a transfer format that can be observed via
>> serialization.
>
> Yes, but it cannot encode lone surrogates. It can only deal in Unicode
> scalar values.
>
>
>> String.prototype.charCodeAt and String.fromCharCode are
>> required for backward compatibility. And ES6 wants to expose code 
>> points as
>> well, so three.
>
> Unicode scalar values are code points sans surrogates, i.e. completely
> compatible with what a utf-8 encoder/decoder pair can handle.
>
> Why do you want to expose surrogates?
>
>
>> Sorry, I missed this: how else (other than the charCodeAt/fromCharCode
>> legacy) are lone surrogates exposed?
>
> "\udfff".codePointAt(0) == "\udfff"
>
> It seems better if that returns "\ufffd", as you'd get with utf-8
> (assuming it accepts code points as input rather than just Unicode
> scalar values, in which case it'd throw).
>
> The indexing of codePointAt() is also kind of sad as it just passes
> through to charCodeAt(), which means for any serious usage you need to
> use the iterator anyway. What's the reason codePointAt() exists?
>
>


More information about the es-discuss mailing list