Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Tue May 17 12:04:25 PDT 2011


> Right - but they are still legitimate code points, and they fill out the space required to let us treat String as uint16[] when defining the backing store as "something that maps to the set of all Unicode code points".

> That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88.
No, you're allowing storage of some sort of number arrays that don't represent Unicode strings at all.

Codepoints != encoding.  High and Low surrogates are legal code points, but are only legitimate code points in UTF-16 if they occur in a pair.  If they aren’t in a proper pair, they’re illegal.  They are always illegal in UTF-32 & UTF-8.  There are other code points that shouldn’t be used for interchange in Unicode too: U+xxFFFF/U+xxFFFE for example.  It’s orthogonal to the other question, but the documentation should clearly suggest that users don’t pretend binary data is character data when it’s not.  That leads to all sorts of crazy stuff, like illegal lone surrogates trying to be illegally encoded in UTF-8.

-Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/58c5675c/attachment.html>


More information about the es-discuss mailing list