Full Unicode strings strawman

Boris Zbarsky bzbarsky at MIT.EDU
Tue May 17 17:09:23 PDT 2011


On 5/17/11 5:24 PM, Wes Garland wrote:
> UTF-8 and UTF-32.  I think UTF-7 can, too, but it is not a standard so
> it's not really worth discussing.  UTF-16 is the odd one out.

That's not what the spec says.

> Okay, I think we have to agree to disagree here. I believe my reading of
> the spec is correct.

Sorry, but no...  how much more clear can the spec get?

>     There are no such valid UTF-8 strings; see spec quotes above.  The
>     proposal would have involved having invalid pseudo-UTF-ish strings.
>
>
> Yes, you can encode code points d800 - dfff in UTF-8 Strings.  These are
> not /well-formed/ strings, but they are Unicode 8-bit Strings (D81)
> nonetheless.

The spec seems to pretty clearly define UTF-8 strings as things that do 
NOT contain the encoding of those code points.  If you think otherwise, 
cite please.

> Further, you can't encode code points d800 - dfff in UTF-16 Strings,

Where does the spec say this?  And why does that part of the spec not 
apply to UTF-8?

> # printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
> 0000000 0000 dc08
> 0000004
> # printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
> 0000000 edb0 8800
> 0000003

As far as I can tell, that second conversion is just an implementation 
bug per the spec.  See the part I quoted which explicitly says that an 
encoder in that situation must stop and return an error.

> The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code
> point 0xdc08"

According to the spec you were citing, that code unit sequence means a 
UTF-8 decoder should error, no?

-Boris


More information about the es-discuss mailing list