Full Unicode strings strawman
allen at wirfs-brock.com
Mon May 16 16:05:41 PDT 2011
On May 16, 2011, at 2:42 PM, Boris Zbarsky wrote:
> On 5/16/11 4:38 PM, Wes Garland wrote:
>> Two great things about strings composed of Unicode code points:
>> If though this is a breaking change from ES-5, I support it
>> whole-heartedly.... but I expect breakage to be very limited. Provided
>> that the implementation does not restrict the storage of reserved code
>> points (D800-DF00)
> Those aren't code points at all. They're just not Unicode.
> If you allow storage of such, then you're allowing mixing Unicode strings and "something else" (whatever the something else is), with bad most likely bad results.
> Most simply, assignign a DOMString containing surrogates to a JS string should collapse the surrogate pairs into the corresponding codepoint if JS strings really contain codepoints...
No, that would be a breaking change to the web!
> The only way to make this work is if either DOMString is redefined or DOMString and full Unicode strings are different kinds of objects.
Not really, you need to make the distinction between what a String can contain and what String contents are valid in specific application domains.
DOMString seems to be quite clearly defined to consists of of 16-bit valued elements interpreted as a UTF-16 encode Unicode string.
All such DOMStings are valid ES strings according to may proposal but it isn't the case all ES Strings are valid DOMStrings. To the depth of my understanding I that this is already the case today with 16-bit ES characters. You can create a ES string which does not conform to the UTF-16 encoding rules.
>> Users doing surrogate pair decomposition will probably find that their code "just works"
> How, exactly?
Because the string will continue to contain surrogate pairs.
>> Users creating Strings with surrogate pairs will need to
> Such users would include the DOM, right?
No. That would be a breaking change in the context of the browser. Programs creating surrogate that want to be updated to not use surrogate pairs are the only ones that need to retool. More likely we are talking about new code that can be written without having to worry about surrogate pairs. If somebody wants to grab a bunch of text from the DOM and manipulate it without encountering surrogate pairs, they will need to explicit perform a decodeUTF16 transformation.
>> but this is a small burden and these users will be at the upper
>> strata of Unicode-foodom.
> You're talking every single web developer here. Or at least every single web developer who wants to work with Devanagari text.
No, they will probably always have a choice for their own internal processing. Deal with logically 16-bit character that use UTF-16. Or deal with logical 21-bit characters. Only when communicating with an external agent (for example the DOM) do you have to adapt to that agents requirments.
>> I suspect that 99.99% of users will find that
>> this change will fix bugs in their code when dealing with non-BMP
> Not unless DOMString is changed or the interaction between the two very carefully defined in failure-proof ways.
>> Why do we care about the UTF-16 representation of particular
> Because of DOMString's use of UTF-16, at least (forced on it by the fact that that's what ES used to do, but here we are).
>> Mike Samuel, can you explain why you are en/decoding UTF-16 when
>> round-tripping through the DOM? Does the DOM specify UTF-16 encoding?
>> If it does, that's silly.
> It needed to specify _something_, and UTF-16 was the thing that was compatible with how scripts work in ES. Not to mention the Java legacy if the DOM...
>> Both ES and DOM should specify "Unicode" and let the data interchange format be an implementation detail.
> That's fine if _both_ are changed. Changing just one without the other would just cause problems.
Somebody has to go first. I'm saying that it has to be ES that goes first. ES can do this without breaking any existing web code.
More information about the es-discuss