Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 15:42:31 PDT 2011


On May 16, 2011, at 2:23 PM, Shawn Steele wrote:

> I’m having some (ok, a great deal of) confusion between the DOM Encoding and the JavaScript encoding and whatever.  I’d assumed that if I had a web page in some encoding, that it was converted to UTF-16 (well, UCS-2), and that’s what the JavaScript engine did it’s work on.  I confess to not having done much encoding stuff in JS in the last decade.
>  
> In UTF-8, individually encoded surrogates are illegal (and a security risk).  Eg: you shouldn’t be able to encode D800/DC00 as two 3 byte sequences, they should be a single 6 byte sequence.  Having not played with the js encoding/decoding in quite some time, I’m not sure what they do in that case, but hopefully it isn’t illegal UTF-8.  (You also shouldn’t be able to have half a surrogate pair in UTF-16, but many things are pretty lax about that.)

I don't know enough about DOM behavior as I probably should, but the implication of the DOMString spec. is that if HTML source code contains a text element with supplemental characters (using any encoding recognized by a browser) then if the text of that element is accessed from JavaScript it will see each supplemental character as two JavaScript character that taken together are the UTF-16 encoding of the original supplemental character.  I'm not proposing any changes in that regard.

There is a chicken and egg issue here.  The DOM will never evolved to directly support non UTF-16 encoded supplemental characters unless ECMAScript first provides such support. It may take 20 years to get there but that clock won't even start until ECMAScript provides the necessary support.

Allen



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/7182f973/attachment-0001.html>


More information about the es-discuss mailing list