Full Unicode strings strawman
Boris Zbarsky
bzbarsky at MIT.EDU
Mon May 16 14:07:24 PDT 2011
On 5/16/11 4:37 PM, Mike Samuel wrote:
> You might have. If you reject my assertion about option 2 above, then
> to clarify,
> The UTF-16 representation of codepoint U+10000 is the code-unit pair
> U+D8000 U+DC000.
No. The UTF-16 representation of codepoint U+10000 is the code-unit
pair 0xD800 0xDC00. These are 16-bit unsigned integers, NOT Unicode
characters (which is what the U+NNNNN notation means).
> The UTF-16 representation of codepoint U+D8000 is the single code-unit
> U+D8000 and similarly for U+DC00.
I'm assuming you meant U+D800 in the first two code-units there.
There is no Unicode codepoint U+D800 or U+DC00. See
http://www.unicode.org/charts/PDF/UD800.pdf and
http://www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there
are no Unicode characters with those codepoints.
> How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
> implementation that uses UTF-16 under the hood from the codepoint
> U+10000?
They don't have to be; if 0xD800 0xDC00 are present (in that order) then
they encode U+10000. If they're present on their own, it's not a valid
UTF-16 string, hence not a valid DOMString and some sort of
error-handling behavior (which presumably needs defining) needs to take
place.
That said, defining JS strings and DOMString differently seems like a
recipe for serious author confusion (e.g. actually using JS strings as
the DOMString binding in ES might be lossy, assigning from JS strings to
DOMString might be lossy, etc). It's a minefield.
-Boris
More information about the es-discuss
mailing list