Full Unicode strings strawman
wes at page.ca
Tue May 17 18:33:42 PDT 2011
On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> On 5/17/11 5:24 PM, Wes Garland wrote:
>> Okay, I think we have to agree to disagree here. I believe my reading of
>> the spec is correct.
> Sorry, but no... how much more clear can the spec get?
In the past, I have read it thus, pseudo BNF:
UnicodeString => CodeUnitSequence // D80
CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
CodeUnit => <anything in the current encoding form> // D77
Upon careful re-reading of this part of the specification, I see that D79 is
also important. It says that "A Unicode encoding form assigns each Unicode
scalar value to a unique code unit sequence.", and further clarifies that
"The mapping of the set of Unicode scalar values to the set of code unit
sequences for a Unicode encoding form is one-to-one."
This means that your original assertion -- that Unicode strings cannot
contain the high surrogate code points, regardless of meaning -- is in fact
Which is unfortunate, as it means that we either
1. Allow non-Unicode strings in JS -- i.e. Strings composed of all values
in the set [0x0, 0x1FFFFF]
2. Keep making programmers pay the raw-UTF-16 representation tax
3. Break the String-as-uint16 pattern
I still believe that #1 is the way forward, and that problem of
round-tripping these values through the DOM is solvable.
Wesley W. Garland
Director, Product Development
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss