Full Unicode strings strawman
Mark Davis ☕
mark at macchiato.com
Tue May 17 18:55:21 PDT 2011
That is incorrect. See below.
Mark
*— Il meglio è l’inimico del bene —*
On Tue, May 17, 2011 at 18:33, Wes Garland <wes at page.ca> wrote:
> On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>
>> On 5/17/11 5:24 PM, Wes Garland wrote:
>>
>>> Okay, I think we have to agree to disagree here. I believe my reading of
>>> the spec is correct.
>>>
>>
>> Sorry, but no... how much more clear can the spec get?
>>
>>
> In the past, I have read it thus, pseudo BNF:
>
> UnicodeString => CodeUnitSequence // D80
> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
> CodeUnit => <anything in the current encoding form> // D77
>
So far, so good. In particular, d800 is a code unit for UTF-16, since it is
a code unit that can occur in some code unit sequence in UTF-16.
>
> Upon careful re-reading of this part of the specification, I see that D79
> is also important. It says that "A Unicode encoding form assigns each
> Unicode scalar value to a unique code unit sequence.",
>
True.
> and further clarifies that "The mapping of the set of Unicode scalar values
> to the set of code unit sequences for a Unicode encoding form is
> one-to-one."
>
True.
This is all consistent with saying that UTF-16 can't contain an isolated
d800.
*However, that only shows that a Unicode 16-bit string (D82) is not the same
as a UTF-16 String (D89), which has been pointed out previously.*
*
*
Repeating the note under D89:
A Unicode string consisting of a well-formed UTF-16 code unit sequence is
said
to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16
string*,
or a *UTF-16 string* for short.
*
*
That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice
versa.
Examples:
- "\u0061\ud800\udc00" is both a Unicode 16-bit string and a UTF-16
string.
- "\u0061\ud800\udc00" is a Unicode 16-bit string, but not a UTF-16
string.
> This means that your original assertion -- that Unicode strings cannot
> contain the high surrogate code points, regardless of meaning -- is in fact
> correct.
>
That is incorrect.
>
> Which is unfortunate, as it means that we either
>
> 1. Allow non-Unicode strings in JS -- i.e. Strings composed of all
> values in the set [0x0, 0x1FFFFF]
> 2. Keep making programmers pay the raw-UTF-16 representation tax
> 3. Break the String-as-uint16 pattern
>
> I still believe that #1 is the way forward, and that problem of
> round-tripping these values through the DOM is solvable.
>
> Wes
>
> --
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/ba51bf30/attachment.html>
More information about the es-discuss
mailing list