Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Tue May 17 13:15:55 PDT 2011


The wrong conclusion is being drawn. I can say definitively that for the
string "a\uD800b".

   - It is a valid Unicode string, according to the Unicode Standard.
   - It cannot be encoded as well-formed in any UTF-x (it is not
   'well-formed' in any UTF).
   - When it comes to conversion, the bad code unit \uD800 needs to be
   handled (eg converted to FFFD, escaped, etc.)

Any programming language using Unicode has the choice of either

   1. allowing strings to be general Unicode strings, or
   2. guaranteeing that they are always well-formed.

There are trade-offs either way, but both are feasible.

Mark

*— Il meglio è l’inimico del bene —*


On Tue, May 17, 2011 at 13:03, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 5/17/11 3:29 PM, Wes Garland wrote:
>
>> But the point remains, the FAQ entry you quote talks about encoding a
>> lone surrogate, i.e. a code unit, which is not a complete code point.
>> You can only convert complete code points from one encoding to another.
>> Just like you can't represent part of a UTF-8 code sub-sequence in any
>> other encoding. The fact that code point X is not representable in
>> UTF-16 has no bearing on its status as a code point, nor its
>> convertability to UTF-8.  The problem is that UTF-16 cannot represent
>> all possible code points.
>>
>
> My point is that neither can UTF-8.  Can you name an encoding that _can_
> represent the surrogate-range codepoints?
>
>
>   From page 90 of the Unicode 6.0 specification, in the Conformance
>> chapter:
>>
>>    /D80 Unicode string:/ A code unit sequence containing code units of
>>    a particular Unicode
>>    encoding form.
>>    • In the rawest form, Unicode strings may be implemented simply as
>>    arrays of
>>    the appropriate integral data type, consisting of a sequence of code
>>    units lined
>>    up one immediately after the other.
>>    • A single Unicode string must contain only code units from a single
>>    Unicode
>>    encoding form. It is not permissible to mix forms within a string.
>>
>>
>>
>>    Not sure what "(D80)" is supposed to mean.
>>
>>
>> Sorry, "(D80)" means "per definition D80 of The Unicode Standard,
>> Version 6.0"
>>
>
> Ah, ok.  So the problem there is that this is definition only makes sense
> when a particular Unicode encoding form has been chosen.  Which Unicode
> encoding form have we chosen here?
>
> But note also that D76 in that same document says:
>
>  Unicode scalar value: Any Unicode code point except high-surrogate
>                        and low-surrogate code points.
>
> and D79 says:
>
>  A Unicode encoding form assigns each Unicode scalar value to a unique
>  code unit sequence.
>
> and
>
>  To ensure that the mapping for a Unicode encoding form is
>  one-to-one, all Unicode scalar values, including those
>  corresponding to noncharacter code points and unassigned code
>  points, must be mapped to unique code unit sequences. Note that
>  this requirement does not extend to high-surrogate and
>  low-surrogate code points, which are excluded by definition from
>  the set of Unicode scalar values.
>
> In particular, this makes it clear (to me, at least) that whatever Unicode
> encoding form you choose, a "Unicode string" can only consist of code units
> encoding Unicode scalar values, which does NOT include high and low
> surrogates.
>
> Therefore I stand by my statement: if you allow what to me looks like
> arrays "UTF-32 code units and also values that fall into the surrogate
> ranges" then you don't get Unicode strings.  You get a set of arrays that
> contains Unicode strings as a proper subset.
>
>
>     OK, that seems like a breaking change.
>>
>> Yes, I believe it would be, certainly if done naively, but I am hopeful
>> somebody can figure out how to overcome this.
>>
>
> As long as we worry about that _before_ enshrining the result in a spec,
> I'm all of being hopeful.
>
>
>     Maybe, and maybe not.  We (Mozilla) have had some proposals to
>>    actually use UTF-8 throughout, including in the JS engine; it's
>>    quite possible to implement an API that looks like a 16-bit array on
>>    top of UTF-8 as long as you allow invalid UTF-8 that's needed to
>>    represent surrogates and the like.
>>
>>
>> I understand by this that in the Moz proposals, you mean that the
>> "invalid" UTF-8 sequences are actually valid UTF-8 Strings which encode
>> code points in the range 0xd800-0xdfff
>>
>
> There are no such valid UTF-8 strings; see spec quotes above.  The proposal
> would have involved having invalid pseudo-UTF-ish strings.
>
>
>  and that these code points were
>> translated directly (and purposefully incorrectly) as UTF-16 code units
>> when viewed as 16-bit arrays.
>>
>
> Yep.
>
>
>  If JS Strings were arrays of Unicode code points, this conversion would
>> be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
>> 0xdc08, with no incorrect conversion taking place.
>>
>
> Sorry, no.  See above.
>
>
>  The only problem is
>> if there is an intermediate component somewhere that insists on using
>> UTF-16..at that point we just can't represent code point 0xdc08 at all.
>>
>
> I just don't get it.  You can stick the invalid 16-bit value 0xdc08 into a
> "UTf-16" string just as easily as you can stick the invalid 24-bit sequence
> 0xed 0xb0 0x88 into a "UTF-8" string.  Can you please, please tell me what
> made you decide there's _any_ difference between the two cases?  They're
> equally invalid in _exactly_ the same way.
>
>
> -Boris
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/f1caa757/attachment.html>


More information about the es-discuss mailing list