Full Unicode strings strawman
Mike Samuel
mikesamuel at gmail.com
Mon May 16 14:16:10 PDT 2011
2011/5/16 Boris Zbarsky <bzbarsky at mit.edu>:
> On 5/16/11 4:37 PM, Mike Samuel wrote:
>>
>> You might have. If you reject my assertion about option 2 above, then
>> to clarify,
>> The UTF-16 representation of codepoint U+10000 is the code-unit pair
>> U+D8000 U+DC000.
>
> No. The UTF-16 representation of codepoint U+10000 is the code-unit pair
> 0xD800 0xDC00. These are 16-bit unsigned integers, NOT Unicode characters
> (which is what the U+NNNNN notation means).
My apologies for abusing notation.
>> The UTF-16 representation of codepoint U+D8000 is the single code-unit
>> U+D8000 and similarly for U+DC00.
>
> I'm assuming you meant U+D800 in the first two code-units there.
yes
> There is no Unicode codepoint U+D800 or U+DC00. See
> http://www.unicode.org/charts/PDF/UD800.pdf and
> http://www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there are
> no Unicode characters with those codepoints.
Correct.
The strawman says
"The String type is the set of all finite ordered sequences of zero or
more 21-bit unsigned integer values (“elements”)."
There is no exclusion for invalid code-points, so I was assuming when
Allen talked about an encodeUTF16 function that he was purposely
fuzzing the term "codepoint" to include the entire range, and that
encodeUTF16(oneSupplemental).charCodeAt(0) === 0xd800.
>> How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
>> implementation that uses UTF-16 under the hood from the codepoint
>> U+10000?
>
> They don't have to be; if 0xD800 0xDC00 are present (in that order) then
> they encode U+10000. If they're present on their own, it's not a valid
> UTF-16 string, hence not a valid DOMString and some sort of error-handling
> behavior (which presumably needs defining) needs to take place.
> That said, defining JS strings and DOMString differently seems like a recipe
> for serious author confusion (e.g. actually using JS strings as the
> DOMString binding in ES might be lossy, assigning from JS strings to
> DOMString might be lossy, etc). It's a minefield.
Agreed. It is a minefield and one that could benefit from treatment
in the strawman.
> -Boris
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
More information about the es-discuss
mailing list