Full Unicode strings strawman

Boris Zbarsky bzbarsky at MIT.EDU
Mon May 16 18:56:55 PDT 2011


On 5/16/11 7:05 PM, Allen Wirfs-Brock wrote:
>> If you allow storage of such, then you're allowing mixing Unicode strings and "something else" (whatever the something else is), with bad most likely bad results.
>>
>> Most simply, assignign a DOMString containing surrogates to a JS string should collapse the surrogate pairs into the corresponding codepoint if JS strings really contain codepoints...
>
> No, that would be a breaking change to the web!

OK, we agree there.

>> The only way to make this work is if either DOMString is redefined or DOMString and full Unicode strings are different kinds of objects.
>
> Not really, you need to make the distinction between what a String can contain and what String contents are valid in specific application domains.
>
> DOMString seems to be quite clearly defined to consists of of 16-bit valued elements  interpreted as a UTF-16 encode Unicode string.

Right, because that's all it can be given what ES string is right now.

> All such DOMStings are valid ES strings according to may proposal but it isn't the case all ES Strings are valid DOMStrings.

OK, what happens when such an ES string is passed to an interface that 
takes a DOMString?

What happens when such an ES string is concatenated with a DOMString 
containing surrogate pairs?  I don't mean on the implementation level 
for the concatenation case; that part is trivial.  I mean what can the 
programmer sanely do with the result?

> To the depth of my understanding I that this is already the case today with 16-bit ES characters.  You can create a ES string which does not conform to the UTF-16 encoding rules.

In practice browsers just let you put non-UTF16 stuff in the DOM and 
then fake things, as far as I can tell.  There are a few bugs around on 
not allowing that, but the cost is too high (e.g. there are popular 
browser implementations in which the DOM and JS share the same reference 
counted string buffer when you pass a string across the boundary).

>>> Users doing surrogate pair decomposition will probably find that their code "just works"
>>
>> How, exactly?
>
> Because the string will continue to contain surrogate pairs.

Until someone somewhere else in the workflow (say an ad on the page, in 
the browser context) adds in a string containing non-BMP codepoints, right?

>>> Users creating Strings with surrogate pairs will need to
>>> re-tool
>>
>> Such users would include the DOM, right?
>
> No. That would be a breaking change in the context of the browser.

OK...

> Programs creating surrogate that want to be updated to not use surrogate pairs are the only ones that need to retool.

I think this is compartmentalizing programs in ways that don't map to 
reality.

> More likely we are talking about new code that can be written without having to worry about surrogate pairs.

And again here.  A lot of JS on the web takes strings from all sorts of 
sources not necessarily under the control of the JS itself (user input, 
XMLHttpRequest of XML and JSON, etc), then mashes them together in 
various ways.

> If somebody wants to grab a bunch of text from the DOM and manipulate it without encountering surrogate pairs, they will need to explicit perform a decodeUTF16 transformation.

What if they don't want to encounter non-BMP characters except in 
surrogate pair form (i.e. have the environment they have now)?

>> You're talking every single web developer here.  Or at least every single web developer who wants to work with Devanagari text.
>
> No, they will probably always have a choice for their own internal processing.  Deal with logically 16-bit character that use UTF-16.  Or deal with logical 21-bit characters.  Only when communicating with an external agent (for example the DOM) do you have to adapt to that agents requirments.

Web JS is _all_ about communicating with external agents.  That's its 
purpose in life, for the most part.

> Somebody has to go first.  I'm saying that it has to be ES that goes first. ES can do this without breaking any existing web code.

I disagree on "somebody has to go first"; it should be possible to 
coordinate such a change.

I agree that if we impose an ordering then clearly ES has to go first.

I think that if we made this change to ES only today and the new 
capabilities were completely unused no existing web code would break.

I also think that if we made this change to ES only today and then part 
but not all of the web got changed to use the new capabilities we would 
break some web code.

I will posit as an axiom that any changes to the web in terms of 
adopting the new feature will be incremental (please let me know if 
there is reason to think this is not the case).  A corollary that I 
believe to be true is that we therefore have to assume that "old 
strings" and "new strings" will coexist in the set of strings scripts 
have to handle as things stand.  This may be true no matter what the DOM 
does, but is _definitely_ true if the DOM remains as it is.

-Boris


More information about the es-discuss mailing list