String.fromCodePoint and surrogate pairs?
Mark Davis ☕
mark at macchiato.com
Mon Jan 14 17:21:43 PST 2013
There is a long discussion of this on the unicode list recently. A
surrogate code point is not illegal Unicode. It is illegal *in* a UTF
string, but is not illegal in a Unicode String (
I don't want to repeat that whole long discussion here.
*— Il meglio è l’inimico del bene —*
On Mon, Jan 14, 2013 at 4:52 PM, Shawn Steele <Shawn.Steele at microsoft.com>wrote:
> It doesn't make sense and is illegal unicode. Eg: it's corrupt data. So
> the only reason to accept it is to allow corrupt data, perhaps as a way of
> faking other non-Unicode data as a Unicode context. Which inevitably leads
> to problems, particularly on the web where people do whatever sneaky things
> the developer thinks works.
> Assuming a use case for illegal unicode were ever found, it could be added
> -----Original Message-----
> From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com]
> Sent: Monday, January 14, 2013 4:35 PM
> To: Shawn Steele
> Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
> Subject: Re: String.fromCodePoint and surrogate pairs?
> I don't have a good scenario at hand either that would require support for
> surrogate code points, but in ECMAScript the question is often asked the
> other way around: Why reject it? And given that there are several ways
> already to construct strings that are ill-formed UTF-16 (e.g., "\uD800",
> String.fromCodeUnit(0xD800)), it's not clear why this particular path
> should be blocked.
> (Sorry for letting this sit in my outbox for such a long time.)
> On Dec 12, 2012, at 15:17 , Shawn Steele wrote:
> > I was looking at D75 of 3.8 "Surrogates"
> > My point is that there's no "legal" scenario for converting basically a
> UTF-32 input to an isolated surrogate pair. No valid Unicode string could
> contain that. So why support it?
> > -Shawn
> > -----Original Message-----
> > From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com]
> > Sent: Wednesday, December 12, 2012 2:40 PM
> > To: Shawn Steele
> > Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
> > Subject: Re: String.fromCodePoint and surrogate pairs?
> > The Unicode standard defines "code point" as any value in the range of
> integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 .
> > Once you exclude surrogate code points, you have Unicode scalar values
> (definition D76), so you're basically proposing a String.fromScalarValue
> function. But then, why not also exclude code points that Unicode has
> defined as non-characters (chapter 16.7 )? It seems we're getting into
> policy-setting here, and so far ECMAScript has avoided setting policy for
> how you can use strings.
> > Norbert
> >  http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
> >  http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
> > On Dec 12, 2012, at 13:55 , Shawn Steele wrote:
> >> IMO String.fromCodePoint should disallow U+D800-U+DFFF.
> >> There's already fromCharCode that does that, and a according to The
> Unicode Standard, isolated surrogates have no meaning on their own and goes
> on to compare them to illegal UTF-8 sequences. IMO "CodePoint" is a 21
> Unicode code point, and explicitly not UTF-16, so it shouldn't confuse
> things by allowing UTF-16 (or other encoding) forms.
> >> -Shawn
> >> -----Original Message-----
> >> From: es-discuss-bounces at mozilla.org
> >> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert
> >> Lindenberg
> >> Sent: Wednesday, December 12, 2012 1:25 PM
> >> To: Erik Arvidsson
> >> Cc: es-discuss at mozilla.org
> >> Subject: Re: String.fromCodePoint and surrogate pairs?
> >> Do you know what the people who talked to you mean by "aware of UTF-16
> code units"?
> >> As specified, String.fromCodePoint, accepts all UTF-16 code units
> because they use a subset of the integers allowed as code points (0 to
> 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly
> what you expect. Surrogate values are interpreted as surrogate code points,
> which are valid code points in Unicode (their use makes a string ill-formed
> in Unicode terminology, but the proposed ECMAScript spec ignores issues of
> well-formedness for compatibility with ES5). Since in conversion to UTF-16
> a surrogate code point just becomes the corresponding code unit, it can
> happen that two surrogate code points (an ill-formed sequence) become a
> well-formed surrogate pair:
> >> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" =
> >> "😄".
> >> The story for UTF-8 is very different: Of course all UTF-8 code units
> would be accepted by String.fromCodePoint, but they would turn into a
> completely different character sequence. E.g., the UTF-8 byte sequence for
> >> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) =>
> "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are
> control characters).
> >> Handling UTF-8 would require a way to identify the character encoding
> to convert from, which indicates the beginning of an encoding conversion
> API, and the internationalization ad-hoc decided not to work on one within
> ECMAScript. There is an API being defined as part of the encoding standard
> project at WhatWG.
> >> Norbert
> >> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:
> >>> It was suggested to me that we could probably extend
> String.fromCodePoint to be aware of UTF-16 code units too. It seems doable
> since the lead surrogate is not a valid code point.
> >>> The question is if it is worth it? It seems like we are going down a
> slippery slope if we start to do things like this. Should we also handle
> UTF-8 code units. Maybe it is better not to do this and try to get people
> to move away from UTF-16 code units and move them towards code points.
> >>> --
> >>> erik
> >> _______________________________________________
> >> es-discuss mailing list
> >> es-discuss at mozilla.org
> >> https://mail.mozilla.org/listinfo/es-discuss
> es-discuss mailing list
> es-discuss at mozilla.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss