Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Tue May 17 18:55:21 PDT 2011


That is incorrect. See below.

Mark

*— Il meglio è l’inimico del bene —*


On Tue, May 17, 2011 at 18:33, Wes Garland <wes at page.ca> wrote:

> On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>
>> On 5/17/11 5:24 PM, Wes Garland wrote:
>>
>>> Okay, I think we have to agree to disagree here. I believe my reading of
>>> the spec is correct.
>>>
>>
>> Sorry, but no...  how much more clear can the spec get?
>>
>>
> In the past, I have read it thus, pseudo BNF:
>
> UnicodeString => CodeUnitSequence // D80
> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
> CodeUnit => <anything in the current encoding form> // D77
>

So far, so good. In particular, d800 is a code unit for UTF-16, since it is
a code unit that can occur in some code unit sequence in UTF-16.


>
> Upon careful re-reading of this part of the specification, I see that D79
> is also important.  It says that "A Unicode encoding form assigns each
> Unicode scalar value to a unique code unit sequence.",
>

True.


> and further clarifies that "The mapping of the set of Unicode scalar values
> to the set of code unit sequences for a Unicode encoding form is
> one-to-one."
>

True.

This is all consistent with saying that UTF-16 can't contain an isolated
d800.

*However, that only shows that a Unicode 16-bit string (D82) is not the same
as a UTF-16 String (D89), which has been pointed out previously.*
*
*

Repeating the note under D89:


A Unicode string consisting of a well-formed UTF-16 code unit sequence is
said
to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16
string*,
or a *UTF-16 string* for short.

*
*
That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice
versa.

Examples:

   - "\u0061\ud800\udc00" is both a Unicode 16-bit string and a UTF-16
   string.
   - "\u0061\ud800\udc00" is a Unicode 16-bit string, but not a UTF-16
   string.



> This means that your original assertion -- that Unicode strings cannot
> contain the high surrogate code points, regardless of meaning -- is in fact
> correct.
>

That is incorrect.


>
> Which is unfortunate, as it means that we either
>
>    1. Allow non-Unicode strings in JS -- i.e. Strings composed of all
>    values in the set [0x0, 0x1FFFFF]
>    2. Keep making programmers pay the raw-UTF-16 representation tax
>    3. Break the String-as-uint16 pattern
>
> I still believe that #1 is the way forward, and that problem of
> round-tripping these values through the DOM is solvable.
>
> Wes
>
> --
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/ba51bf30/attachment.html>


More information about the es-discuss mailing list