Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Tue May 17 14:30:53 PDT 2011


> The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code point 0xdc08",
In UTF-8 0xed 0xb0 0x88 means “Garbage, please replace me with 0xFFFD”.  CESU-8 allows this, but that sequence is illegal in UTF-8.  The Windows SDK and .Net both disallow ill-formed UTF-8 code points for security reasons.  I’m sure you can find other libraries that allow them still, but this sequence is ill-formed and considered a security threat.  D92 of unicode 5.0 makes this clear.
> and in UTF-16 0xdc08 means "Part of some non-BMP code point".

Only if there was a 0xd800-0xdbff before it.  Otherwise it is also ill-formed.
-Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/47b73766/attachment.html>


More information about the es-discuss mailing list