Full Unicode strings strawman
Boris Zbarsky
bzbarsky at MIT.EDU
Mon May 16 14:52:03 PDT 2011
On 5/16/11 5:23 PM, Shawn Steele wrote:
> I’m having some (ok, a great deal of) confusion between the DOM Encoding
> and the JavaScript encoding and whatever. I’d assumed that if I had a
> web page in some encoding, that it was converted to UTF-16 (well,
> UCS-2), and that’s what the JavaScript engine did it’s work on.
JS strings are currently defined as arrays of 16-bit unsigned integers.
I believe the intent at the time was that these could represent actual
Unicode strings encoded as UCS-2, but they can also represent arbitrary
arrays of 16-bit unsigned integers.
The DOM just uses JS strings for DOMString and defines DOMString to be
UTF-16. That's not quite compatible with UCS-2, but....
JS strings can contain integers that correspond to UTF-16 surrogates.
There are no constraints in what comes before/after them in JS strings.
> In UTF-8, individually encoded surrogates are illegal (and a security
> risk). Eg: you shouldn’t be able to encode D800/DC00 as two 3 byte
> sequences, they should be a single 6 byte sequence
A single 4 byte sequence, actually, last I checked.
> Having not played
> with the js encoding/decoding in quite some time, I’m not sure what they
> do in that case, but hopefully it isn’t illegal UTF-8.
I'm not sure which "they" and under what conditions we're considering here.
> (You also
> shouldn’t be able to have half a surrogate pair in UTF-16, but many
> things are pretty lax about that.)
Verily.
-Boris
More information about the es-discuss
mailing list