Full Unicode strings strawman
wes at page.ca
Tue May 17 11:12:55 PDT 2011
On 17 May 2011 12:36, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> Not quite: code points D800-DFFF are reserved code points which are not
>> representable with UTF-16.
> Nor with any other Unicode encoding, really. They don't represent, on
> their own, Unicode characters.
Right - but they are still legitimate code points, and they fill out the
space required to let us treat String as uint16 when defining the backing
store as "something that maps to the set of all Unicode code points".
That said, you can encode these code points with utf-8; for example, 0xdc08
becomes 0xed 0xb0 0x88.
No, you're allowing storage of some sort of number arrays that don't
> represent Unicode strings at all.
No, if I understand Allen's proposal correctly, we're allowing storage of
some sort of number arrays that may contain reserved code points, some of
which cannot be represented in UTF-16.
This isn't that different from the status quo; it is possible right now to
generate JS Strings which are not valid UTF-16 by creating invalid surrogate
Keep in mind, also, that even a sequence of random bytes is a valid Unicode
string. The standard does not require that they be well-formed. (D80)
> Right, so if it's looking for non-BMP characters in the string, say,
> instead of computing the length, it won't find them. How the heck is that
> "just works"?
My untested hypothesis is that the vast majority of JS code looking for
non-BMP characters is looking for them in order to call them out for special
processing, because the code unit and code point size are different. When
they don't need special processing, they don't need to be found. Since the
high-surrogate code points do not appear in well-formed Unicode strings,
they will not be found, and the unneeded special processing will not
happen. This train of clauses forms the basis for my opinion that, for the
majority of folks, things will "just work".
> What would that even mean? DOMString is defined to be an ES string in the
> ES binding right now. Is the proposal to have some other kind of object for
> DOMString (so that, for example, String.prototype would no longer affect the
> behavior of DOMString the way it does now)?
Wait, are DOMStrings formally UTF-16, or are they ES Strings?
> This might mean that it is possible that
>> JSString=>DOMString would throw, as full Unicode Strings could contain
>> code points which are not representable in UTF-16.
> How is that different from sticking non-UTF-16 into an ES string right now?
Currently, JS Strings are effectively arrays of 16-bit code units, which are
indistinguishable from 16-bit Unicode strings (D82). This means that a JS
application can use JS Strings as arrays of uint16, and expect to be able to
round-trip all strings, even those which are not well-formed, through a
If we redefine JS Strings to be arrays of Unicode code points, then the JS
application can use JS Strings as arrays uint21 -- but round-tripping the
high-surrogate code points through a UTF-16 layer would not work.
> It might mean extra copying, or it might not if the DOM implementation
>> already uses
>> UTF-8 internally.
> Uh... what does UTF-8 have to do with this?
If you're already storing UTF-8 strings internally, then you are already
doing something "expensive" (like copying) to get their code units into and
out of JS; so no incremental perf impact by not having a common UTF-16
> (As a note, Gecko and WebKit both use UTF-16 internally; I would be
> _really_ surprised if Trident does not. No idea about Presto.)
FWIW - last I time I scanned the v8 sources, it appeared to use a
three-representation class, which could store either ASCII, UCS2, or UTF-8.
Presumably ASCII could also be ISO-Latin-1, as both are exact, naive,
byte-sized UCS2/UTF-16 subsets.
Wesley W. Garland
Director, Product Development
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss