Full Unicode strings strawman

Boris Zbarsky bzbarsky at MIT.EDU
Tue May 17 11:39:46 PDT 2011


On 5/17/11 2:12 PM, Wes Garland wrote:
> That said, you can encode these code points with utf-8; for example,
> 0xdc08 becomes 0xed 0xb0 0x88.

By the same argument, you can encode them in UTF-16.  The byte sequence 
above is not valid UTF-8.  See "How do I convert an unpaired UTF-16 
surrogate to UTF-8?" at http://unicode.org/faq/utf_bom.html which says:

   A different issue arises if an unpaired surrogate is encountered
   when converting ill-formed UTF-16 data. By represented such an
   unpaired surrogate on its own as a 3-byte sequence, the resulting
   UTF-8 data stream would become ill-formed. While it faithfully
   reflects the nature of the input, Unicode conformance requires that
   encoding form conversion always results in valid data stream.
   Therefore a converter must treat this as an error.

(fwiw, this is the third hit on Google for "utf-8 surrogates" right 
after the Wikipedia articles on UTF-8 and UTF-16, so it's not like it's 
hard to find this information).

>     No, you're allowing storage of some sort of number arrays that don't
>     represent Unicode strings at all.
>
> No, if I understand Allen's proposal correctly, we're allowing storage
> of some sort of number arrays that may contain reserved code points,
> some of which cannot be represented in UTF-16.

See above.  You're allowing number arrays that may or may not be 
interpretable as Unicode strings, period.

> This isn't that different from the status quo; it is possible right now
> to generate JS Strings which are not valid UTF-16 by creating invalid
> surrogate pairs.

True.  However right now no one is pretending that strings are anything 
other than arrays of 16-bit units.

> Keep in mind, also, that even a sequence of random bytes is a valid
> Unicode string. The standard does not require that they be well-formed.
> (D80)

Uh...  A sequence of _bytes_ is not anything related to Unicode unless 
you know how it's encoded.

Not sure what "(D80)" is supposed to mean.

>     Right, so if it's looking for non-BMP characters in the string, say,
>     instead of computing the length, it won't find them.  How the heck
>     is that "just works"?
>
> My untested hypothesis is that the vast majority of JS code looking for
> non-BMP characters is looking for them in order to call them out for
> special processing, because the code unit and code point size are
> different.  When they don't need special processing, they don't need to
> be found.

This hypothesis is worth testing before being blindly inflicted on the web.

>     What would that even mean?  DOMString is defined to be an ES string
>     in the ES binding right now.  Is the proposal to have some other
>     kind of object for DOMString (so that, for example, String.prototype
>     would no longer affect the behavior of DOMString the way it does now)?
>
> Wait, are DOMStrings formally UTF-16, or are they ES Strings?

DOMStrings are formally UTF-16 in the DOM spec.

They are defined to be ES strings in the ES binding for the DOM.

Please be careful to not confuse the DOM and its language bindings.

One could change the ES binding to use a non-ES-string object to 
preserve the DOM's requirement that strings be sequences of UTF-16 code 
units.  I'd expect this would break the web unless one is really careful 
doing it...

>     How is that different from sticking non-UTF-16 into an ES string
>     right now?
>
> Currently, JS Strings are effectively arrays of 16-bit code units, which
> are indistinguishable from 16-bit Unicode strings

Yes.

> (D82)

?

> This means that a JS application can use JS Strings as arrays of uint16, and expect
> to be able to round-trip all strings, even those which are not
> well-formed, through a UTF-16 DOM.

Yep.  And they do.

> If we redefine JS Strings to be arrays of Unicode code points, then the
> JS application can use JS Strings as arrays uint21 -- but round-tripping
> the high-surrogate code points through a UTF-16 layer would not work.

OK, that seems like a breaking change.

>         It might mean extra copying, or it might not if the DOM
>         implementation already uses
>         UTF-8 internally.
>
>     Uh... what does UTF-8 have to do with this?
>
> If you're already storing UTF-8 strings internally, then you are already
> doing something "expensive" (like copying) to get their code units into
> and out of JS

Maybe, and maybe not.  We (Mozilla) have had some proposals to actually 
use UTF-8 throughout, including in the JS engine; it's quite possible to 
implement an API that looks like a 16-bit array on top of UTF-8 as long 
as you allow invalid UTF-8 that's needed to represent surrogates and the 
like.

>     (As a note, Gecko and WebKit both use UTF-16 internally; I would be
>     _really_ surprised if Trident does not.  No idea about Presto.)
>
> FWIW - last I time I scanned the v8 sources, it appeared to use a
> three-representation class, which could store either ASCII, UCS2, or
> UTF-8.  Presumably ASCII could also be ISO-Latin-1, as both are exact,
> naive, byte-sized UCS2/UTF-16 subsets.

There's a difference between internal representation and what things 
look like.  For example, Gecko stores DOM text nodes as either ASCII or 
UTF-16 in practice, but always makes them look like UTF-16 to 
non-internal consumers....

There's also a possible difference, as you just noted, between what the 
ES implementation uses and what the DOM uses; certainly in the WebKit+V8 
case, but also in the Gecko+Spidermonkey case when textnodes are 
involved, etc.

I was talking about what the DOM implementations do, not the ES 
implementations.

-Boris


More information about the es-discuss mailing list