Full Unicode strings strawman
Mark Davis ☕
mark at macchiato.com
Tue May 17 18:55:21 PDT 2011
That is incorrect. See below.
*— Il meglio è l’inimico del bene —*
On Tue, May 17, 2011 at 18:33, Wes Garland <wes at page.ca> wrote:
> On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>> On 5/17/11 5:24 PM, Wes Garland wrote:
>>> Okay, I think we have to agree to disagree here. I believe my reading of
>>> the spec is correct.
>> Sorry, but no... how much more clear can the spec get?
> In the past, I have read it thus, pseudo BNF:
> UnicodeString => CodeUnitSequence // D80
> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
> CodeUnit => <anything in the current encoding form> // D77
So far, so good. In particular, d800 is a code unit for UTF-16, since it is
a code unit that can occur in some code unit sequence in UTF-16.
> Upon careful re-reading of this part of the specification, I see that D79
> is also important. It says that "A Unicode encoding form assigns each
> Unicode scalar value to a unique code unit sequence.",
> and further clarifies that "The mapping of the set of Unicode scalar values
> to the set of code unit sequences for a Unicode encoding form is
This is all consistent with saying that UTF-16 can't contain an isolated
*However, that only shows that a Unicode 16-bit string (D82) is not the same
as a UTF-16 String (D89), which has been pointed out previously.*
Repeating the note under D89:
A Unicode string consisting of a well-formed UTF-16 code unit sequence is
to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16
or a *UTF-16 string* for short.
That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice
- "\u0061\ud800\udc00" is both a Unicode 16-bit string and a UTF-16
- "\u0061\ud800\udc00" is a Unicode 16-bit string, but not a UTF-16
> This means that your original assertion -- that Unicode strings cannot
> contain the high surrogate code points, regardless of meaning -- is in fact
That is incorrect.
> Which is unfortunate, as it means that we either
> 1. Allow non-Unicode strings in JS -- i.e. Strings composed of all
> values in the set [0x0, 0x1FFFFF]
> 2. Keep making programmers pay the raw-UTF-16 representation tax
> 3. Break the String-as-uint16 pattern
> I still believe that #1 is the way forward, and that problem of
> round-tripping these values through the DOM is solvable.
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
> es-discuss mailing list
> es-discuss at mozilla.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss