Code points vs Unicode scalar values

Anne van Kesteren annevk at
Wed Sep 4 12:57:28 PDT 2013

On Wed, Sep 4, 2013 at 6:22 PM, Allen Wirfs-Brock <allen at> wrote:
> WRT the larger issue, these API are for people who need to deal with text at the encoding level.

At that level you want to deal with bytes. And we have an API for
this: I'd hope people would be
smart enough not to add more encoding cruft, but we can't stop them
and I don't think this API should be designed for them.

> For example, they might be intentionally generating invalid UTF-16 encodings as part of a test driver.

Generate what though? If you want to generate surrogates you can
always go back to using 16-bit code units. There's no need for this to
leak through to the higher level abstraction.

> Note that the behavior of String.fromCodePoint parrallels that of string literals:
> String.fromCodePoint(0x1d11e)
> StringfromCodePoint(0xd834,0xdd12)
> "\u{1d11e}"
> "\ud834\udd12"
> all produce the same string value.

If "\u{...}" is new, it'd be great if that banned surrogates too.

I learned from Simon today Rust is doing the same thing for its char
type. (Rust has some other issues where you can assign arbitrary byte
values to a string even in safe mode, but it's still early days in
that language.)


More information about the es-discuss mailing list