Full Unicode strings strawman

Boris Zbarsky bzbarsky at MIT.EDU
Tue May 17 13:03:19 PDT 2011


On 5/17/11 3:29 PM, Wes Garland wrote:
> But the point remains, the FAQ entry you quote talks about encoding a
> lone surrogate, i.e. a code unit, which is not a complete code point.
> You can only convert complete code points from one encoding to another.
> Just like you can't represent part of a UTF-8 code sub-sequence in any
> other encoding. The fact that code point X is not representable in
> UTF-16 has no bearing on its status as a code point, nor its
> convertability to UTF-8.  The problem is that UTF-16 cannot represent
> all possible code points.

My point is that neither can UTF-8.  Can you name an encoding that _can_ 
represent the surrogate-range codepoints?

>  From page 90 of the Unicode 6.0 specification, in the Conformance chapter:
>
>     /D80 Unicode string:/ A code unit sequence containing code units of
>     a particular Unicode
>     encoding form.
>     • In the rawest form, Unicode strings may be implemented simply as
>     arrays of
>     the appropriate integral data type, consisting of a sequence of code
>     units lined
>     up one immediately after the other.
>     • A single Unicode string must contain only code units from a single
>     Unicode
>     encoding form. It is not permissible to mix forms within a string.
>
>
>
>     Not sure what "(D80)" is supposed to mean.
>
>
> Sorry, "(D80)" means "per definition D80 of The Unicode Standard,
> Version 6.0"

Ah, ok.  So the problem there is that this is definition only makes 
sense when a particular Unicode encoding form has been chosen.  Which 
Unicode encoding form have we chosen here?

But note also that D76 in that same document says:

   Unicode scalar value: Any Unicode code point except high-surrogate
                         and low-surrogate code points.

and D79 says:

   A Unicode encoding form assigns each Unicode scalar value to a unique
   code unit sequence.

and

   To ensure that the mapping for a Unicode encoding form is
   one-to-one, all Unicode scalar values, including those
   corresponding to noncharacter code points and unassigned code
   points, must be mapped to unique code unit sequences. Note that
   this requirement does not extend to high-surrogate and
   low-surrogate code points, which are excluded by definition from
   the set of Unicode scalar values.

In particular, this makes it clear (to me, at least) that whatever 
Unicode encoding form you choose, a "Unicode string" can only consist of 
code units encoding Unicode scalar values, which does NOT include high 
and low surrogates.

Therefore I stand by my statement: if you allow what to me looks like 
arrays "UTF-32 code units and also values that fall into the surrogate 
ranges" then you don't get Unicode strings.  You get a set of arrays 
that contains Unicode strings as a proper subset.

>     OK, that seems like a breaking change.
>
> Yes, I believe it would be, certainly if done naively, but I am hopeful
> somebody can figure out how to overcome this.

As long as we worry about that _before_ enshrining the result in a spec, 
I'm all of being hopeful.

>     Maybe, and maybe not.  We (Mozilla) have had some proposals to
>     actually use UTF-8 throughout, including in the JS engine; it's
>     quite possible to implement an API that looks like a 16-bit array on
>     top of UTF-8 as long as you allow invalid UTF-8 that's needed to
>     represent surrogates and the like.
>
>
> I understand by this that in the Moz proposals, you mean that the
> "invalid" UTF-8 sequences are actually valid UTF-8 Strings which encode
> code points in the range 0xd800-0xdfff

There are no such valid UTF-8 strings; see spec quotes above.  The 
proposal would have involved having invalid pseudo-UTF-ish strings.

> and that these code points were
> translated directly (and purposefully incorrectly) as UTF-16 code units
> when viewed as 16-bit arrays.

Yep.

> If JS Strings were arrays of Unicode code points, this conversion would
> be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
> 0xdc08, with no incorrect conversion taking place.

Sorry, no.  See above.

> The only problem is
> if there is an intermediate component somewhere that insists on using
> UTF-16..at that point we just can't represent code point 0xdc08 at all.

I just don't get it.  You can stick the invalid 16-bit value 0xdc08 into 
a "UTf-16" string just as easily as you can stick the invalid 24-bit 
sequence 0xed 0xb0 0x88 into a "UTF-8" string.  Can you please, please 
tell me what made you decide there's _any_ difference between the two 
cases?  They're equally invalid in _exactly_ the same way.

-Boris


More information about the es-discuss mailing list