Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 15:07:00 PDT 2011


On May 16, 2011, at 2:16 PM, Mike Samuel wrote:

> 2011/5/16 Boris Zbarsky <bzbarsky at mit.edu>:
>> On 5/16/11 4:37 PM, Mike Samuel wrote:
>>> 
>>> 
> 
>> There is no Unicode codepoint U+D800 or U+DC00.  See
>> http://www.unicode.org/charts/PDF/UD800.pdf and
>> http://www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there are
>> no Unicode characters with those codepoints.
> 
> Correct.
> The strawman says
> 
> "The String type is the set of all finite ordered sequences of zero or
> more 21-bit unsigned integer values (“elements”)."
> 
> There is no exclusion for invalid code-points, so I was assuming when
> Allen talked about an encodeUTF16 function that he was purposely
> fuzzing the term "codepoint" to include the entire range, and that
> encodeUTF16(oneSupplemental).charCodeAt(0) === 0xd800.

Correct in my proposal, ES string elements are 21-bit values.  All possible values are useable even though some are  not valid Unicode code points. We may not have a clear common language let for referring to such element values.  If current ES we call them "character codes" but we need to be careful about moving that terminology forward because it occurs in APIs that depend upon character codes being 16-bit values.

encodeUTF16 is a Unicode domain specific function.  It would need to define what it does when encountering a "character code" that is not a valid codepoint.

Allen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/fc6600a6/attachment.html>


More information about the es-discuss mailing list