Full Unicode strings strawman

Mike Samuel mikesamuel at gmail.com
Mon May 16 14:10:53 PDT 2011


2011/5/16 Wes Garland <wes at page.ca>:
> Mike Samuel, can you explain why you are en/decoding UTF-16 when
> round-tripping through the DOM?

I was UTF-16 encoding it because there will be host objects in
browsers that assume a UTF-16 encoding and so a possibility for
orphaned surrogates in internal representations based on UTF-16.

I was wondering how those strings round trip across host object
boundaries.  When a programmer assigns a string to a property, they
expect a string with the same length to come out.  When it doesn't,
hilarity ensues.

> Does the DOM specify UTF-16 encoding?

Yes.
http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 says

   Type Definition DOMString
   A DOMString is a sequence of 16-bit units.

> If it does, that's silly.

Yes, it is.  It is also a published standard assumed by a lot of existing code.


> Both ES and DOM should specify "Unicode" and let the
> data interchange format be an implementation detail.

> It is an unfortunate
> accident of history that UTF-16 surrogate pairs leak their abstraction into
> ES Strings, and I believe it is high time we fixed that.

I agree, but there's a coordination problem.  TC39 can't redefine
DOMString on their own.
The DOMString definition is not maintained by ECMA, and changes to it
affect bindings for Java and C#, which uses UTF-16, and languages like
python which is technically agnostic but is normally compiled to treat
a unicode string as a sequence of UTF-16 code-units.


More information about the es-discuss mailing list