Code points vs Unicode scalar values

Allen Wirfs-Brock allen at wirfs-brock.com
Wed Sep 4 10:22:17 PDT 2013


On Sep 4, 2013, at 9:46 AM, Brendan Eich wrote:

> Mathias Bynens wrote:
>> I think what Anne means to say is that `String.fromCodePoint(0xD800)` returns '\uD800` as per that algorithm, which is a lone surrogate (and not a scalar value).
> 
> Gotcha. Yes, the new APIs seem to let you write and read lone surrogates. But the legacy APIs won't go away, and IIRC the reasoning is that we're better off exposing the data than trying to abstract away from it in the new APIs. Allen?

First a couple meta points
  1)  this stuff is mostly Norbert's design so he may be able to provide better rationale for some of the decisions. 
  2)  there are a number of open bugs on the current spec. WRT Unicode handling.  We'll get around to those soon.

WRT the larger issue, these API are for people who need to deal with text at the encoding level. They might be writing their own encoders/decoders/translators.  At that level,  surrogates really are  valid code points even though they are not valid Unicode scalar values. People programming at that level in some cases have to deal with malformed encodings.  For example, they might be intentionally generating invalid UTF-16 encodings as part of a test driver. 

Note that the behavior of String.fromCodePoint parrallels that of string literals:

String.fromCodePoint(0x1d11e)
StringfromCodePoint(0xd834,0xdd12)
"\u{1d11e}"
"\ud834\udd12"

all produce the same string value.

Allen



More information about the es-discuss mailing list