Full Unicode strings strawman
Mike Samuel
mikesamuel at gmail.com
Mon May 16 14:10:53 PDT 2011
2011/5/16 Wes Garland <wes at page.ca>:
> Mike Samuel, can you explain why you are en/decoding UTF-16 when
> round-tripping through the DOM?
I was UTF-16 encoding it because there will be host objects in
browsers that assume a UTF-16 encoding and so a possibility for
orphaned surrogates in internal representations based on UTF-16.
I was wondering how those strings round trip across host object
boundaries. When a programmer assigns a string to a property, they
expect a string with the same length to come out. When it doesn't,
hilarity ensues.
> Does the DOM specify UTF-16 encoding?
Yes.
http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 says
Type Definition DOMString
A DOMString is a sequence of 16-bit units.
> If it does, that's silly.
Yes, it is. It is also a published standard assumed by a lot of existing code.
> Both ES and DOM should specify "Unicode" and let the
> data interchange format be an implementation detail.
> It is an unfortunate
> accident of history that UTF-16 surrogate pairs leak their abstraction into
> ES Strings, and I believe it is high time we fixed that.
I agree, but there's a coordination problem. TC39 can't redefine
DOMString on their own.
The DOMString definition is not maintained by ECMA, and changes to it
affect bindings for Java and C#, which uses UTF-16, and languages like
python which is technically agnostic but is normally compiled to treat
a unicode string as a sequence of UTF-16 code-units.
More information about the es-discuss
mailing list