Full Unicode strings strawman

Wes Garland wes at page.ca
Tue May 17 14:24:25 PDT 2011


On 17 May 2011 16:03, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 5/17/11 3:29 PM, Wes Garland wrote:
>
>> The problem is that UTF-16 cannot represent
>> all possible code points.
>>
>
> My point is that neither can UTF-8.  Can you name an encoding that _can_
> represent the surrogate-range codepoints?
>

UTF-8 and UTF-32.  I think UTF-7 can, too, but it is not a standard so it's
not really worth discussing.  UTF-16 is the odd one out.

Therefore I stand by my statement: if you allow what to me looks like arrays
> "UTF-32 code units and also values that fall into the surrogate ranges" then
> you don't get Unicode strings.  You get a set of arrays that contains
> Unicode strings as a proper subset.
>

Okay, I think we have to agree to disagree here. I believe my reading of the
spec is correct.


> There are no such valid UTF-8 strings; see spec quotes above.  The proposal
> would have involved having invalid pseudo-UTF-ish strings.
>

Yes, you can encode code points d800 - dfff in UTF-8 Strings.  These are not
*well-formed* strings, but they are Unicode 8-bit Strings (D81) nonetheless.
What you can't do is encode 16-bit code units in UTF-8 Strings. This is
because you can only convert from one encoding to another via code points.
Code units have no cross-encoding meaning.

Further, you can't encode code points d800 - dfff in UTF-16 Strings, leaving
you at a loss when you want to store those values in JS Strings (i.e. when
using them as uint16[]) except to generate ill-formed UTF-16. I believe it
would be far better to treat those values as Unicode code points, not 16-bit
code units, and to allow JS String elements to be able to express the whole
21-bit code point range afforded by Unicode.

In other words, current mis-use of JS Strings which can store "characters"
0-ffff in ill-formed UTF-16 strings would become use of JS Strings to store
code points 0-1FFFFF which may use reserved code points d800-dfff, the high
surrogates, which cannot be represented in UTF-16. But CAN be represented,
without loss, in UTF-8, UTF-32, and proposed-new-JS-Strings.


>  If JS Strings were arrays of Unicode code points, this conversion would
>> be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
>> 0xdc08, with no incorrect conversion taking place.
>>
>
> Sorry, no.  See above.
>

# printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
0000000 0000 dc08
0000004
# printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
0000000 edb0 8800
0000003

I just don't get it.  You can stick the invalid 16-bit value 0xdc08 into a
> "UTf-16" string just as easily as you can stick the invalid 24-bit sequence
> 0xed 0xb0 0x88 into a "UTF-8" string.  Can you please, please tell me what
> made you decide there's _any_ difference between the two cases?  They're
> equally invalid in _exactly_ the same way.
>
>
The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code
point 0xdc08", and in UTF-16 0xdc08 means "Part of some non-BMP code point".

Said another way, 0xed in UTF-8 has nearly the same meaning as 0xdc08 in
UTF-16.  Both are ill-formed code unit subsequences which do not represent a
code unit (D84a).

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/eceb700d/attachment-0001.html>


More information about the es-discuss mailing list