Full Unicode strings strawman

Boris Zbarsky bzbarsky at MIT.EDU
Mon May 16 14:07:24 PDT 2011


On 5/16/11 4:37 PM, Mike Samuel wrote:
> You might have.  If you reject my assertion about option 2 above, then
> to clarify,
> The UTF-16 representation of codepoint U+10000 is the code-unit pair
> U+D8000 U+DC000.

No.  The UTF-16 representation of codepoint U+10000 is the code-unit 
pair 0xD800 0xDC00.  These are 16-bit unsigned integers, NOT Unicode 
characters (which is what the U+NNNNN notation means).

> The UTF-16 representation of codepoint U+D8000 is the single code-unit
> U+D8000 and similarly for U+DC00.

I'm assuming you meant U+D800 in the first two code-units there.

There is no Unicode codepoint U+D800 or U+DC00.  See 
http://www.unicode.org/charts/PDF/UD800.pdf and 
http://www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there 
are no Unicode characters with those codepoints.

> How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
> implementation that uses UTF-16 under the hood from the codepoint
> U+10000?

They don't have to be; if 0xD800 0xDC00 are present (in that order) then 
they encode U+10000.  If they're present on their own, it's not a valid 
UTF-16 string, hence not a valid DOMString and some sort of 
error-handling behavior (which presumably needs defining) needs to take 
place.

That said, defining JS strings and DOMString differently seems like a 
recipe for serious author confusion (e.g. actually using JS strings as 
the DOMString binding in ES might be lossy, assigning from JS strings to 
DOMString might be lossy, etc).  It's a minefield.

-Boris


More information about the es-discuss mailing list