Full Unicode strings strawman

Boris Zbarsky bzbarsky at MIT.EDU
Tue May 17 09:36:58 PDT 2011


On 5/17/11 10:40 AM, Wes Garland wrote:
> On 16 May 2011 17:42, Boris Zbarsky <bzbarsky at mit.edu
>     Those aren't code points at all.  They're just not Unicode.
>
> Not quite: code points D800-DFFF are reserved code points which are not
> representable with UTF-16.

Nor with any other Unicode encoding, really.  They don't represent, on 
their own, Unicode characters.

>     If you allow storage of such, then you're allowing mixing Unicode
>     strings and "something else" (whatever the something else is), with
>     bad most likely bad results.
>
> I don't believe this is true. We are merely allowing storage of Unicode
> strings which cannot be converted into UTF-16.

No, you're allowing storage of some sort of number arrays that don't 
represent Unicode strings at all.

>         Users doing surrogate pair decomposition will probably find that
>         their code "just works"
>
>     How, exactly?
>
> /** Untested and not rigourous */
> function unicode_strlen(validUnicodeString) {
>    var length = 0;
>    for (var i = 0; i < validUnicodeString.length; i++)  {
>      if (validUnicodeString.charCodeAt(i) >= 0xd800 &&
> validUnicodeString.charCodeAt(i) <= 0xdc00)
>        continue;
>      length++;
>    }
>    return length;
> }
>
> Code like this ^^^^ which looks for surrogate pairs in valid Unicode
> strings will simply not find them, instead only finding code points
> which seem to the same size as the code unit.

Right, so if it's looking for non-BMP characters in the string, say, 
instead of computing the length, it won't find them.  How the heck is 
that "just works"?

>         Users creating Strings with surrogate pairs will need to
>         re-tool
>
>     Such users would include the DOM, right?
>
> I am hopeful that most web browsers have one or few interfaces between
> DOM strings and JS strings.

A number of web browsers have an interface between DOM and JS strings 
that consists of either "memcpy" or "addref the buffer".

 >  I do not know if my hopes reflect reality.

They probably do, so you're only really talking about at least 10 
different places across at least 5 different codebases that have to be 
fixed, in a coordinated way...


>     You're talking every single web developer here.  Or at least every
>     single web developer who wants to work with Devanagari text.
>
> I don't think so.  I bet if we could survey web developers across the
> industry (rather than just top-tier people who tend to participate in
> discussions like this one), we would find that the vast major of them
> never both handling non-BMP cases, and do not test non-BMP cases.

And how many of them use libraries that handle that for them?

And how many implicitly rely on DOM-to-JS roundtripping without 
explicitly doing anything with non-BMP chars or surrogate pairs?

> Heck, I don't even know if a non-BMP character can be data-entered into
> an <input type="text" maxlength="1"> or not. (Do you? What happens?)

It cannot in Gecko, as I recall; there maxlength is interpreted as 
number of UTF-16 code units.

In WebKit, maxlength is interpreted as the number of grapheme clusters 
based on my look at their code just now.

I don't know offhand about Presto and Trident, for obvious reasons.

> We could get away with converting UTF-16 at DOMString <> JSString
> transition point.

What would that even mean?  DOMString is defined to be an ES string in 
the ES binding right now.  Is the proposal to have some other kind of 
object for DOMString (so that, for example, String.prototype would no 
longer affect the behavior of DOMString the way it does now)?

> This might mean that it is possible that
> JSString=>DOMString would throw, as full Unicode Strings could contain
> code points which are not representable in UTF-16.

How is that different from sticking non-UTF-16 into an ES string right now?

> If don't throw on invalid-in-UTF-16 code points, then round-tripping is
> lossy. If it does, that's silly.

So both options suck, yes?  ;)

>     It needed to specify _something_, and UTF-16 was the thing that was
>     compatible with how scripts work in ES.  Not to mention the Java
>     legacy if the DOM...
>
> By this comment, I am inferring then that DOM and JS Strings share their
> backing store.

That's not what the comment was about, actually.  The comment was about API.

But yes, in many cases they do share backing store.

> Do all the browsers share JS/DOM String backing stores?

Gecko does in some cases.

WebKit+JSC does in all cases, I believe (or at least a large majority of 
cases).

I don't know about others.

> There is no question in mind that this proposal would break
> Unicode-aware JS.

As far as I can tell it would also break Unicode-unaware JS.

> It is my belief that that doesn't matter if it
> accompanies other major, opt-in changes.

It it's opt-in, perhaps.

> Resolving DOM String <> JS String interchange is a little trickier, but
> I think it can be managed if we can allow JS=>DOM to throw when high
> surrogate code points are encountered in the JS String.

I'm 99% sure this would break sites.

> It might mean extra copying, or it might not if the DOM implementation already uses
> UTF-8 internally.

Uh... what does UTF-8 have to do with this?

(As a note, Gecko and WebKit both use UTF-16 internally; I would be 
_really_ surprised if Trident does not.  No idea about Presto.)

-Boris


More information about the es-discuss mailing list