Full Unicode strings strawman
Shawn.Steele at microsoft.com
Tue May 17 14:30:53 PDT 2011
> The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code point 0xdc08",
In UTF-8 0xed 0xb0 0x88 means “Garbage, please replace me with 0xFFFD”. CESU-8 allows this, but that sequence is illegal in UTF-8. The Windows SDK and .Net both disallow ill-formed UTF-8 code points for security reasons. I’m sure you can find other libraries that allow them still, but this sequence is ill-formed and considered a security threat. D92 of unicode 5.0 makes this clear.
> and in UTF-16 0xdc08 means "Part of some non-BMP code point".
Only if there was a 0xd800-0xdbff before it. Otherwise it is also ill-formed.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss