Full Unicode strings strawman

Mike Samuel mikesamuel at gmail.com
Mon May 16 13:37:57 PDT 2011


2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
>
> On May 16, 2011, at 12:28 PM, Mike Samuel wrote:
>
> > DOMString is defined at
> > http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus
> >
> >    Type Definition DOMString
> >    A DOMString is a sequence of 16-bit units.
> >
> > so how would round tripping a JS string through a DOM string work?
>
> Because, the DOM spec. says: "Applications must encode DOMString using
> UTF-16 (defined in [Unicode] and Amendment 1 of [ISO/IEC 10646])." it must
> continue to do this.
> Values return as DOM strings would (21-bit char enhanced) ES strings where
> each string character contained a 16-bit UTF-16 code unit.  Just like they
> do now. Processing of such strings would have to do explicit surrogate pair
> processing just like they do now.  However, such a string could be converted
> to a non-UTF-16 ecoded string by explicitly user code or via a new built-in
> function such as:
>    String.UTF16Decode(aDOMStringValue)
> For passing strings from ES to a DOMString we have to do the inverse
> conversions. If explicit decoding was done as suggested above then explicit
> UTF-16 encoding probably should be done. But note that the internal
> representation of the string is likely to know if the an actual string
> contains any characters with codepoints > \uffff.  It may be reasonable to
> assume that strings without such characters are already DOMString encoded
> but that stings with such characters should be automatically UTF-16 encoded
> when they are passed as DOMString values.
>
> > How would
> >
> >    var oneSupplemental = "\U00010000";

> I don't think I understand you literal notation. \U is a 32-bit character
> value?  I whose implementation?

Sorry, please read this as
    var oneSupplemental = String.fromCharCode(0x10000);


> >    alert(oneSupplemental.length);  //  alerts 1
> >
> I'll take your word for this

If I understand, a string containing the single codepoint U+10000
should have length 1.

    "The length of a String is the number of elements (i.e.,
16-bit\b\b\b\b\b\b 21-bit values) within it."

> >    var utf16Encoded = encodeUTF16(oneSupplemental);
> >    alert(utf16Encoded.length);  //  alerts 2
>
> yes
>
> >    var textNode = document.createTextNode(utf16Encoded);
> >    alert(textNode.nodeValue.length);   // alerts ?
>
> 2
>
> > Does the DOM need to represent utf16Encoded internally so that it can
> > report 2 as the length on fetch of nodeValue?
>
> However the DOM representations DOMString values internally, to conform to
> the DOM spec. it must act as if it is representing them using UTF-16.

Ok.  This seems to present two options:
(1) Break the internet by binding DOMStrings to a JavaScript host type
and not to the JavaScript string type.
(2) DOMStrings never contain supplemental codepoints.

So for either alert(typeof

   var roundTripped = document.createTextNode(oneSupplemental).nodeValue

either

    typeof roundTripped !== "string"

or

    roundTripped.length != oneSupplemental.length



> > If so, how can it
> > represent that for systems that use a UTF-16 internal representation
> > for DOMString?
>
> Let me know if I haven't already answered this.

You might have.  If you reject my assertion about option 2 above, then
to clarify,
The UTF-16 representation of codepoint U+10000 is the code-unit pair
U+D8000 U+DC000.
The UTF-16 representation of codepoint U+D8000 is the single code-unit
U+D8000 and similarly for U+DC00.

How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
implementation that uses UTF-16 under the hood from the codepoint
U+10000?

> Allen
>


More information about the es-discuss mailing list