Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Wed May 18 14:02:27 PDT 2011


#1 can’t happen.  There’s no way to get legal input, since any input must be encoded in some form, and since Unicode clearly states that lone values like D800 are illegal in any of the encodings.

Also, none of the inputs really like UTF-32.  We can munge it from UTF-8 or UTF-16 HTML to something else, but the developer still has it as UTF-8 or UTF-16, so this isn’t much of a burden for them.

But we can still allow code point notation (U+10FFFF), which mitigates most of the problem.

-Shawn

From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Wes Garland
Sent: Tuesday, May 17, 2011 6:34 PM
To: Boris Zbarsky
Cc: es-discuss at mozilla.org
Subject: Re: Full Unicode strings strawman

On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu<mailto:bzbarsky at mit.edu>> wrote:
On 5/17/11 5:24 PM, Wes Garland wrote:
Okay, I think we have to agree to disagree here. I believe my reading of
the spec is correct.

Sorry, but no...  how much more clear can the spec get?

In the past, I have read it thus, pseudo BNF:

UnicodeString => CodeUnitSequence // D80
CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
CodeUnit => <anything in the current encoding form> // D77

Upon careful re-reading of this part of the specification, I see that D79 is also important.  It says that "A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.", and further clarifies that "The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one."

This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct.

Which is unfortunate, as it means that we either

  1.  Allow non-Unicode strings in JS -- i.e. Strings composed of all values in the set [0x0, 0x1FFFFF]
  2.  Keep making programmers pay the raw-UTF-16 representation tax
  3.  Break the String-as-uint16 pattern
I still believe that #1 is the way forward, and that problem of round-tripping these values through the DOM is solvable.

Wes

--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110518/b5de036f/attachment.html>


More information about the es-discuss mailing list