Full Unicode strings strawman

Wes Garland wes at page.ca
Tue May 17 18:33:42 PDT 2011


On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 5/17/11 5:24 PM, Wes Garland wrote:
>
>> Okay, I think we have to agree to disagree here. I believe my reading of
>> the spec is correct.
>>
>
> Sorry, but no...  how much more clear can the spec get?
>
>
In the past, I have read it thus, pseudo BNF:

UnicodeString => CodeUnitSequence // D80
CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
CodeUnit => <anything in the current encoding form> // D77

Upon careful re-reading of this part of the specification, I see that D79 is
also important.  It says that "A Unicode encoding form assigns each Unicode
scalar value to a unique code unit sequence.", and further clarifies that
"The mapping of the set of Unicode scalar values to the set of code unit
sequences for a Unicode encoding form is one-to-one."

This means that your original assertion -- that Unicode strings cannot
contain the high surrogate code points, regardless of meaning -- is in fact
correct.

Which is unfortunate, as it means that we either

   1. Allow non-Unicode strings in JS -- i.e. Strings composed of all values
   in the set [0x0, 0x1FFFFF]
   2. Keep making programmers pay the raw-UTF-16 representation tax
   3. Break the String-as-uint16 pattern

I still believe that #1 is the way forward, and that problem of
round-tripping these values through the DOM is solvable.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/24ca269d/attachment-0001.html>


More information about the es-discuss mailing list