Code points vs Unicode scalar values

Brendan Eich brendan at mozilla.com
Wed Sep 11 11:51:10 PDT 2013


> Anne van Kesteren <mailto:annevk at annevk.nl>
> September 11, 2013 3:43 AM
>
> It's not clear the arguments were carefully considered though. Shawn
> Steele raised the same concerns I did. The unicode.org thread also
> suggests that the ideal value space for a string is Unicode scalar
> values (i.e. what utf-8 can do) and not code points. It did indeed
> indicate they have code points because of legacy, but JavaScript has
> 16-bit code units due to legacy. If we're going to offer a higher
> level of abstraction over the basic string type, we can very well make
> that a utf-8 safe layer.

You could be right, but this is a deep topic, not sorted out by 
programming language developers, in my view. It came up recently here:

http://www.haskell.org/pipermail/haskell-cafe/2013-September/108654.html

That thread continues. The point about C winning because it doesn't have 
an abstract String type, only char[], is winning in my view. Yes, it's 
low level and you have to cope with multiple encodings, but any attempt 
at a more abstract view would have made a badly leaky abstraction, which 
would have been more of a boat anchor.

/be
>
> If you need anything for tests, you can just ignore the higher level
> of abstraction and operate on 16-bit code units instead.
>
>
> Brendan Eich <mailto:brendan at mozilla.com>
> September 5, 2013 2:08 PM
> Thanks for the reminders -- we've been over this.
>
> /be
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
> Norbert Lindenberg <mailto:ecmascript at lindenbergsoftware.com>
> September 5, 2013 12:07 PM
>
> Previous discussion of allowing surrogate code points:
> https://mail.mozilla.org/pipermail/es-discuss/2012-December/thread.html#27057
> https://mail.mozilla.org/pipermail/es-discuss/2013-January/thread.html#28086
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29
>
> Essentially, ECMAScript strings are Unicode strings as defined in The 
> Unicode Standard section 2.7, and thus may contain unpaired surrogate 
> code units in their 16-bit form or surrogate code points when 
> interpreted as 32-bit sequences. String.fromCodePoint and 
> String.prototype.codePointAt just convert between 16-bit and 32-bit 
> forms; they're not meant to interpret the code points beyond that, and 
> some processing (such as test cases) may depend on them being 
> preserved. This is different from encoding for communication over 
> networks, where the use of valid UTF-8 or UTF-16 (which cannot contain 
> surrogate code points) is generally required.
>
> The indexing issue was first discussed in the form "why can't we just 
> use UTF-32"? See
> http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32
> for pointers to that. It would have been great to use UTF-8, but it's 
> unfortunately not compatible with the past and the DOM.
>
> Adding code point indexing to 16-bit code unit strings would add 
> significant performance overhead. In reality, whether an index is for 
> 16-bit or 32-bit units matters only for some relatively low-level 
> software that needs to process code point by code point. A lot of 
> software deals with complete strings without ever looking inside, or 
> is fine processing code unit by code unit (e.g., 
> String.prototype.indexOf).
>
> Norbert
> Brendan Eich <mailto:brendan at mozilla.com>
> September 4, 2013 2:28 PM
>
>
> 8. Let first be the code unit value of the element at index position 
> in the String S.
>
> This does not "[pass] through to charCodeAt()" literally, which would 
> mean a call to S.charCodeAt(position). I thought that's what you meant.
>
> So you want a code point index, not a code unit index. That would not 
> be useful for the lower-level purposes Allen identified. Again it 
> seems you're trying to abstract away from all the details that 
> probably will matter for string hackers using these APIs. But I summon 
> Norbert at this point!
>
> /be
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
> Anne van Kesteren <mailto:annevk at annevk.nl>
> September 4, 2013 12:51 PM
> On Wed, Sep 4, 2013 at 5:34 PM, Brendan Eich<brendan at mozilla.com>  wrote:
>> Because of String.fromCharCode precedent. Balanced names with noun phrases
>> that distinguish the "from" domains are better than longAndPortly vs. tiny.
>
> I kinda liked it as analogue to what exists for Array and because
> developers should probably move away from fromCharCode so the
> precedent does not matter that much.
>
>
>> Sure, but you wanted to reduce "three concepts" and I don't see how to do
>> that. Most developers can ignore UTF-8, for sure.
>
> The three concepts are: 16-bit code units, code points, and Unicode
> scalar values. JavaScript, DOM, etc. deal with 16-bit code units.
> utf-8 et al deal with Unicode scalar values. Nothing, apart from this
> API, does code points at the moment.
>
>
>> Probably I just misunderstood what you meant, and you were simply pointing
>> out that lone surrogates arise only from legacy APIs?
>
> No, they arise from this API.
>
>
>> Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (
>> ...codePoints):
>>
>> No exposed surrogates here!
>
> Mathias covered this.
>
>
>> Here's the spec for String.prototype.codePointAt:
>>
>> 8. Let first be the code unit value of the element at index position in the
>> String S.
>> 11. If second<  0xDC00 or second>  0xDFFF, then return first.
>>
>> I take it you are objecting to step 11?
>
> And step 8. The indexing is based on code units so you cannot actually
> do indexing easily. You'd need to use the iterator to iterate over a
> string getting only code points out.
>
>
>>> The indexing of codePointAt() is also kind of sad as it just passes
>>> through to charCodeAt(),
>> I don't see that in the spec cited above.
>
> How do you read step 8?
>
>


More information about the es-discuss mailing list