Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Tue May 17 18:55:21 PDT 2011

That is incorrect. See below.


*— Il meglio è l’inimico del bene —*

On Tue, May 17, 2011 at 18:33, Wes Garland <wes at page.ca> wrote:

> On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>> On 5/17/11 5:24 PM, Wes Garland wrote:
>>> Okay, I think we have to agree to disagree here. I believe my reading of
>>> the spec is correct.
>> Sorry, but no...  how much more clear can the spec get?
> In the past, I have read it thus, pseudo BNF:
> UnicodeString => CodeUnitSequence // D80
> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
> CodeUnit => <anything in the current encoding form> // D77

So far, so good. In particular, d800 is a code unit for UTF-16, since it is
a code unit that can occur in some code unit sequence in UTF-16.

> Upon careful re-reading of this part of the specification, I see that D79
> is also important.  It says that "A Unicode encoding form assigns each
> Unicode scalar value to a unique code unit sequence.",


> and further clarifies that "The mapping of the set of Unicode scalar values
> to the set of code unit sequences for a Unicode encoding form is
> one-to-one."


This is all consistent with saying that UTF-16 can't contain an isolated

*However, that only shows that a Unicode 16-bit string (D82) is not the same
as a UTF-16 String (D89), which has been pointed out previously.*

Repeating the note under D89:

A Unicode string consisting of a well-formed UTF-16 code unit sequence is
to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16
or a *UTF-16 string* for short.

That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice


   - "\u0061\ud800\udc00" is both a Unicode 16-bit string and a UTF-16
   - "\u0061\ud800\udc00" is a Unicode 16-bit string, but not a UTF-16

> This means that your original assertion -- that Unicode strings cannot
> contain the high surrogate code points, regardless of meaning -- is in fact
> correct.

That is incorrect.

> Which is unfortunate, as it means that we either
>    1. Allow non-Unicode strings in JS -- i.e. Strings composed of all
>    values in the set [0x0, 0x1FFFFF]
>    2. Keep making programmers pay the raw-UTF-16 representation tax
>    3. Break the String-as-uint16 pattern
> I still believe that #1 is the way forward, and that problem of
> round-tripping these values through the DOM is solvable.
> Wes
> --
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/ba51bf30/attachment.html>

More information about the es-discuss mailing list