Full Unicode strings strawman
bzbarsky at MIT.EDU
Tue May 17 13:03:19 PDT 2011
On 5/17/11 3:29 PM, Wes Garland wrote:
> But the point remains, the FAQ entry you quote talks about encoding a
> lone surrogate, i.e. a code unit, which is not a complete code point.
> You can only convert complete code points from one encoding to another.
> Just like you can't represent part of a UTF-8 code sub-sequence in any
> other encoding. The fact that code point X is not representable in
> UTF-16 has no bearing on its status as a code point, nor its
> convertability to UTF-8. The problem is that UTF-16 cannot represent
> all possible code points.
My point is that neither can UTF-8. Can you name an encoding that _can_
represent the surrogate-range codepoints?
> From page 90 of the Unicode 6.0 specification, in the Conformance chapter:
> /D80 Unicode string:/ A code unit sequence containing code units of
> a particular Unicode
> encoding form.
> • In the rawest form, Unicode strings may be implemented simply as
> arrays of
> the appropriate integral data type, consisting of a sequence of code
> units lined
> up one immediately after the other.
> • A single Unicode string must contain only code units from a single
> encoding form. It is not permissible to mix forms within a string.
> Not sure what "(D80)" is supposed to mean.
> Sorry, "(D80)" means "per definition D80 of The Unicode Standard,
> Version 6.0"
Ah, ok. So the problem there is that this is definition only makes
sense when a particular Unicode encoding form has been chosen. Which
Unicode encoding form have we chosen here?
But note also that D76 in that same document says:
Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.
and D79 says:
A Unicode encoding form assigns each Unicode scalar value to a unique
code unit sequence.
To ensure that the mapping for a Unicode encoding form is
one-to-one, all Unicode scalar values, including those
corresponding to noncharacter code points and unassigned code
points, must be mapped to unique code unit sequences. Note that
this requirement does not extend to high-surrogate and
low-surrogate code points, which are excluded by definition from
the set of Unicode scalar values.
In particular, this makes it clear (to me, at least) that whatever
Unicode encoding form you choose, a "Unicode string" can only consist of
code units encoding Unicode scalar values, which does NOT include high
and low surrogates.
Therefore I stand by my statement: if you allow what to me looks like
arrays "UTF-32 code units and also values that fall into the surrogate
ranges" then you don't get Unicode strings. You get a set of arrays
that contains Unicode strings as a proper subset.
> OK, that seems like a breaking change.
> Yes, I believe it would be, certainly if done naively, but I am hopeful
> somebody can figure out how to overcome this.
As long as we worry about that _before_ enshrining the result in a spec,
I'm all of being hopeful.
> Maybe, and maybe not. We (Mozilla) have had some proposals to
> actually use UTF-8 throughout, including in the JS engine; it's
> quite possible to implement an API that looks like a 16-bit array on
> top of UTF-8 as long as you allow invalid UTF-8 that's needed to
> represent surrogates and the like.
> I understand by this that in the Moz proposals, you mean that the
> "invalid" UTF-8 sequences are actually valid UTF-8 Strings which encode
> code points in the range 0xd800-0xdfff
There are no such valid UTF-8 strings; see spec quotes above. The
proposal would have involved having invalid pseudo-UTF-ish strings.
> and that these code points were
> translated directly (and purposefully incorrectly) as UTF-16 code units
> when viewed as 16-bit arrays.
> If JS Strings were arrays of Unicode code points, this conversion would
> be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
> 0xdc08, with no incorrect conversion taking place.
Sorry, no. See above.
> The only problem is
> if there is an intermediate component somewhere that insists on using
> UTF-16..at that point we just can't represent code point 0xdc08 at all.
I just don't get it. You can stick the invalid 16-bit value 0xdc08 into
a "UTf-16" string just as easily as you can stick the invalid 24-bit
sequence 0xed 0xb0 0x88 into a "UTF-8" string. Can you please, please
tell me what made you decide there's _any_ difference between the two
cases? They're equally invalid in _exactly_ the same way.
More information about the es-discuss