Full Unicode strings strawman

Wes Garland wes at page.ca
Tue May 17 07:40:57 PDT 2011

On 16 May 2011 17:42, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 5/16/11 4:38 PM, Wes Garland wrote:
>> Two great things about strings composed of Unicode code points:
> ...
>  If though this is a breaking change from ES-5, I support it
>> whole-heartedly.... but I expect breakage to be very limited. Provided
>> that the implementation does not restrict the storage of reserved code
>> points (D800-DF00)
> Those aren't code points at all.  They're just not Unicode.

Not quite: code points D800-DFFF are reserved code points which are not
representable with UTF-16. Definition D71, Unicode 6.0.

> If you allow storage of such, then you're allowing mixing Unicode strings
> and "something else" (whatever the something else is), with bad most likely
> bad results.

I don't believe this is true. We are merely allowing storage of Unicode
strings which cannot be converted into UTF-16.   That allows us to maintain
most of the existing String behaviour  (arbitrary array of uint16), although
overflowing like this would break:

a = String.fromCharCode(str.charCodeAt(0) + 1)

when str[0] is 0+FFFF.

> Most simply, assignign a DOMString containing surrogates to a JS string
> should collapse the surrogate pairs into the corresponding codepoint if JS
> strings really contain codepoints...
> The only way to make this work is if either DOMString is redefined or
> DOMString and full Unicode strings are different kinds of objects.
>  Users doing surrogate pair decomposition will probably find that their
>> code "just works"
> How, exactly?

/** Untested and not rigourous */
function unicode_strlen(validUnicodeString) {
  var length = 0;
  for (var i = 0; i < validUnicodeString.length; i++)  {
    if (validUnicodeString.charCodeAt(i) >= 0xd800 &&
validUnicodeString.charCodeAt(i) <= 0xdc00)
  return length;

Code like this ^^^^ which looks for surrogate pairs in valid Unicode strings
will simply not find them, instead only finding code points which seem to
the same size as the code unit.

>  Users creating Strings with surrogate pairs will need to
>> re-tool
> Such users would include the DOM, right?

I am hopeful that most web browsers have one or few interfaces between DOM
strings and JS strings.  I do not know if my hopes reflect reality.

>  but this is a small burden and these users will be at the upper
>> strata of Unicode-foodom.
> You're talking every single web developer here.  Or at least every single
> web developer who wants to work with Devanagari text.

I don't think so.  I bet if we could survey web developers across the
industry (rather than just top-tier people who tend to participate in
discussions like this one), we would find that the vast major of them never
both handling non-BMP cases, and do not test non-BMP cases.

Heck, I don't even know if a non-BMP character can be data-entered into an
<input type="text" maxlength="1"> or not. (Do you? What happens?)

>  I suspect that 99.99% of users will find that
>> this change will fix bugs in their code when dealing with non-BMP
>> characters.
> Not unless DOMString is changed or the interaction between the two very
> carefully defined in failure-proof ways.

Yes, I was dismayed to find out that DOMString defines UTF-16.

We could get away with converting UTF-16 at DOMString <> JSString transition
point.  This might mean that it is possible that JSString=>DOMString would
throw, as full Unicode Strings could contain code points which are not
representable in UTF-16.

If don't throw on invalid-in-UTF-16 code points, then round-tripping is
lossy. If it does, that's silly.

> It needed to specify _something_, and UTF-16 was the thing that was
> compatible with how scripts work in ES.  Not to mention the Java legacy if
> the DOM...

By this comment, I am inferring then that DOM and JS Strings share their
backing store.  From an API-cleanliness point of view, that's too bad. From
an implementation POV, it makes sense.  Actually, it makes even more sense
when I recall the discussion we had last week when you explained how
external strings etc work in SpiderMonkey/Gecko.

Do all the browsers share JS/DOM String backing stores?

 It is an unfortunate accident of history that UTF-16 surrogate pairs leak
> their
> abstraction into ES Strings, and I believe it is high time we fixed that.

If you can do that without breaking web pages, great.  If not, then we need
> to talk.  ;)
There is no question in mind that this proposal would break Unicode-aware
JS.  It is my belief that that doesn't matter if it accompanies other major,
opt-in changes.

Resolving DOM String <> JS String interchange is a little trickier, but I
think it can be managed if we can allow JS=>DOM to throw when high surrogate
code points are encountered in the JS String.  It might mean extra copying,
or it might not if the DOM implementation already uses UTF-8 internally.


Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/83df8a81/attachment.html>

More information about the es-discuss mailing list