Code points vs Unicode scalar values

Bjoern Hoehrmann derhoermi at
Sun Sep 22 13:25:54 PDT 2013

* Anne van Kesteren wrote:
>ES6 introduces String.prototype.codePointAt() and
>String.codePointFrom() as well as an iterator (not defined). It struck
>me this is the only place in the platform where we'd expose code point
>as a concept to developers.
>Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.)
>or Unicode scalar values (anytime you hit the network and use utf-8).
>I'm not sure I'm a big fan of having all three concepts around. We
>could have String.prototype.unicodeAt() and String.unicodeFrom()
>instead, and have them translate lone surrogates into U+FFFD. Lone
>surrogates are a bug and I don't see a reason to expose them in more
>places than just the 16-bit code units.

I would regard that as silent data corruption which has the odd habit of
causing hazardous anomalies in code and makes reasoning about it harder.

This is akin to adding edge cases. There are many desirable properties a
function or its implementation can have, like purity and idempotence or
reflexivity. When functions and relations have such properties 99.99% of
the time, people tend to write code as if it had them without exception.

An example I came across today is this:

  var parsed = JSON.parse("-0");
  1 / parsed                             === -Infinity; // true
  1 / JSON.parse(JSON.stringify(parsed)) === -Infinity; // false

That is, JSON.parse preserves negative zero, but JSON.stringify does
not. Well, in Firefox and Webkit; in Opera 12.x both comparisons are
false. If JSON.stringify did not silently corrupt negative zero into
positive zero, we would probably have one less bug to contend with.

If you look at `codePointAt` over the domain of strings of .length 1 at
the first position, then it is injective, in fact, it's the identity
function. And if you apply the `fromCodePoint` method to the output of
`codePointAt` in this case, the data roundtrips nicely. If instead the
functions would silently corrupt data, if `codePointAt` returned 0xFFFD
when the input was 0xFFFD but also when hitting a lone surrogate, these
properties would be lost.

Relatedly, if `codePointAt` would throw an exception when hitting a lone
surrogate, you may very well end up with a bug that breaks your whole
application because someone accidentally put an emoji character at the
wrong position in a string in a database and there is some unfortunate
freak combination of code unit oriented API calls, like .substring or a
regular expression, that splits the emoji in the middle. Returning an
error code, like a negative number or undefined, might have the same
effect, depending on what happens if you pass those values to other
string-related functions.

Note that emitting a replacement character when encountering character
encoding errors in bitstreams is a well-known form of hazardous silent
data corruption and systems that require integrity forbid doing that. As
an example, the WebSocket protocol requires implementations to consider
a WebSocket connection fatally broken upon encountering a malfored UTF-8
sequence in a text frame. That is the right thing to do because when the
sender of those bytes sends the wrong bytes, it may also send the wrong
byte count, meaning payload data might be misinterpreted as frame and
message meta data (useful only to attackers); and on the receiving end,
emitting replacement characters might change the byte length of a
string, but some code accidentally uses the unmodified byte length in
further processing which quickly leads to memory corruption bugs, which
are very bad.

Unfortunately ecmascript makes it very difficult to ensure you do not
generate strings with unpaired surrogate code points somewhere in them,
it's as easy as taking the first 157 .length units from a string and
perhaps appending "..." to abbreviate it. And it's a freak accident if
that actually happens in practise because non-BMP characters are rare.
We should be very reluctant to introduce hazards hoping to improve our
Unicode hygiene.
Björn Höhrmann · mailto:bjoern at ·
Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · 

More information about the es-discuss mailing list