String.fromCodePoint and surrogate pairs?

Norbert Lindenberg ecmascript at norbertlindenberg.com
Wed Dec 12 13:25:16 PST 2012


Do you know what the people who talked to you mean by "aware of UTF-16 code units"?

As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
String.fromCodePoint(0xD83D, 0xDE04) =>
"\uD83D\uDE04" =
"😄".

The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) =>
"\u00F0\u009F\u0098\u0084" =
"ð\u009F\u0098\u0084" (the last three are control characters).

Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.

Norbert


On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:

> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
> 
> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
> 
> -- 
> erik



More information about the es-discuss mailing list