Full Unicode strings strawman
Boris Zbarsky
bzbarsky at MIT.EDU
Tue May 17 17:09:23 PDT 2011
On 5/17/11 5:24 PM, Wes Garland wrote:
> UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so
> it's not really worth discussing. UTF-16 is the odd one out.
That's not what the spec says.
> Okay, I think we have to agree to disagree here. I believe my reading of
> the spec is correct.
Sorry, but no... how much more clear can the spec get?
> There are no such valid UTF-8 strings; see spec quotes above. The
> proposal would have involved having invalid pseudo-UTF-ish strings.
>
>
> Yes, you can encode code points d800 - dfff in UTF-8 Strings. These are
> not /well-formed/ strings, but they are Unicode 8-bit Strings (D81)
> nonetheless.
The spec seems to pretty clearly define UTF-8 strings as things that do
NOT contain the encoding of those code points. If you think otherwise,
cite please.
> Further, you can't encode code points d800 - dfff in UTF-16 Strings,
Where does the spec say this? And why does that part of the spec not
apply to UTF-8?
> # printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
> 0000000 0000 dc08
> 0000004
> # printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
> 0000000 edb0 8800
> 0000003
As far as I can tell, that second conversion is just an implementation
bug per the spec. See the part I quoted which explicitly says that an
encoder in that situation must stop and return an error.
> The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code
> point 0xdc08"
According to the spec you were citing, that code unit sequence means a
UTF-8 decoder should error, no?
-Boris
More information about the es-discuss
mailing list