String.fromCodePoint and surrogate pairs?

Norbert Lindenberg ecmascript at norbertlindenberg.com
Mon Jan 14 16:34:54 PST 2013


I don't have a good scenario at hand either that would require support for surrogate code points, but in ECMAScript the question is often asked the other way around: Why reject it? And given that there are several ways already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", String.fromCodeUnit(0xD800)), it's not clear why this particular path should be blocked.

(Sorry for letting this sit in my outbox for such a long time.)

Norbert


On Dec 12, 2012, at 15:17 , Shawn Steele wrote:

> I was looking at D75 of 3.8 "Surrogates"
> 
> My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair.  No valid Unicode string could contain that.  So why support it?
> 
> -Shawn
> 
> -----Original Message-----
> From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] 
> Sent: Wednesday, December 12, 2012 2:40 PM
> To: Shawn Steele
> Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
> Subject: Re: String.fromCodePoint and surrogate pairs?
> 
> The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].
> 
> Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings.
> 
> Norbert
> 
> [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
> [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
> 
> 
> On Dec 12, 2012, at 13:55 , Shawn Steele wrote:
> 
>> IMO String.fromCodePoint should disallow U+D800-U+DFFF.
>> 
>> There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences.  IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms.
>> 
>> -Shawn
>> 
>> -----Original Message-----
>> From: es-discuss-bounces at mozilla.org 
>> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert 
>> Lindenberg
>> Sent: Wednesday, December 12, 2012 1:25 PM
>> To: Erik Arvidsson
>> Cc: es-discuss at mozilla.org
>> Subject: Re: String.fromCodePoint and surrogate pairs?
>> 
>> Do you know what the people who talked to you mean by "aware of UTF-16 code units"?
>> 
>> As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
>> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" =
>> "😄".
>> 
>> The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
>> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are control characters).
>> 
>> Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.
>> 
>> Norbert
>> 
>> 
>> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:
>> 
>>> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
>>> 
>>> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
>>> 
>>> --
>>> erik
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
> 
> 
> 



More information about the es-discuss mailing list