Full Unicode strings strawman

Mike Samuel mikesamuel at gmail.com
Mon May 16 14:16:10 PDT 2011


2011/5/16 Boris Zbarsky <bzbarsky at mit.edu>:
> On 5/16/11 4:37 PM, Mike Samuel wrote:
>>
>> You might have.  If you reject my assertion about option 2 above, then
>> to clarify,
>> The UTF-16 representation of codepoint U+10000 is the code-unit pair
>> U+D8000 U+DC000.
>
> No.  The UTF-16 representation of codepoint U+10000 is the code-unit pair
> 0xD800 0xDC00.  These are 16-bit unsigned integers, NOT Unicode characters
> (which is what the U+NNNNN notation means).

My apologies for abusing notation.

>> The UTF-16 representation of codepoint U+D8000 is the single code-unit
>> U+D8000 and similarly for U+DC00.
>
> I'm assuming you meant U+D800 in the first two code-units there.

yes

> There is no Unicode codepoint U+D800 or U+DC00.  See
> http://www.unicode.org/charts/PDF/UD800.pdf and
> http://www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there are
> no Unicode characters with those codepoints.

Correct.
The strawman says

"The String type is the set of all finite ordered sequences of zero or
more 21-bit unsigned integer values (“elements”)."

There is no exclusion for invalid code-points, so I was assuming when
Allen talked about an encodeUTF16 function that he was purposely
fuzzing the term "codepoint" to include the entire range, and that
encodeUTF16(oneSupplemental).charCodeAt(0) === 0xd800.


>> How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
>> implementation that uses UTF-16 under the hood from the codepoint
>> U+10000?
>
> They don't have to be; if 0xD800 0xDC00 are present (in that order) then
> they encode U+10000.  If they're present on their own, it's not a valid
> UTF-16 string, hence not a valid DOMString and some sort of error-handling
> behavior (which presumably needs defining) needs to take place.

> That said, defining JS strings and DOMString differently seems like a recipe
> for serious author confusion (e.g. actually using JS strings as the
> DOMString binding in ES might be lossy, assigning from JS strings to
> DOMString might be lossy, etc).  It's a minefield.

Agreed.  It is a minefield and one that could benefit from treatment
in the strawman.

> -Boris
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>


More information about the es-discuss mailing list