Full Unicode strings strawman

Boris Zbarsky bzbarsky at MIT.EDU
Mon May 16 14:52:03 PDT 2011


On 5/16/11 5:23 PM, Shawn Steele wrote:
> I’m having some (ok, a great deal of) confusion between the DOM Encoding
> and the JavaScript encoding and whatever. I’d assumed that if I had a
> web page in some encoding, that it was converted to UTF-16 (well,
> UCS-2), and that’s what the JavaScript engine did it’s work on.

JS strings are currently defined as arrays of 16-bit unsigned integers. 
  I believe the intent at the time was that these could represent actual 
Unicode strings encoded as UCS-2, but they can also represent arbitrary 
arrays of 16-bit unsigned integers.

The DOM just uses JS strings for DOMString and defines DOMString to be 
UTF-16.  That's not quite compatible with UCS-2, but....

JS strings can contain integers that correspond to UTF-16 surrogates. 
There are no constraints in what comes before/after them in JS strings.

> In UTF-8, individually encoded surrogates are illegal (and a security
> risk). Eg: you shouldn’t be able to encode D800/DC00 as two 3 byte
> sequences, they should be a single 6 byte sequence

A single 4 byte sequence, actually, last I checked.

> Having not played
> with the js encoding/decoding in quite some time, I’m not sure what they
> do in that case, but hopefully it isn’t illegal UTF-8.

I'm not sure which "they" and under what conditions we're considering here.

> (You also
> shouldn’t be able to have half a surrogate pair in UTF-16, but many
> things are pretty lax about that.)

Verily.

-Boris


More information about the es-discuss mailing list