Full Unicode strings strawman
Mike Samuel
mikesamuel at gmail.com
Mon May 16 13:37:57 PDT 2011
2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
>
> On May 16, 2011, at 12:28 PM, Mike Samuel wrote:
>
> > DOMString is defined at
> > http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus
> >
> > Type Definition DOMString
> > A DOMString is a sequence of 16-bit units.
> >
> > so how would round tripping a JS string through a DOM string work?
>
> Because, the DOM spec. says: "Applications must encode DOMString using
> UTF-16 (defined in [Unicode] and Amendment 1 of [ISO/IEC 10646])." it must
> continue to do this.
> Values return as DOM strings would (21-bit char enhanced) ES strings where
> each string character contained a 16-bit UTF-16 code unit. Just like they
> do now. Processing of such strings would have to do explicit surrogate pair
> processing just like they do now. However, such a string could be converted
> to a non-UTF-16 ecoded string by explicitly user code or via a new built-in
> function such as:
> String.UTF16Decode(aDOMStringValue)
> For passing strings from ES to a DOMString we have to do the inverse
> conversions. If explicit decoding was done as suggested above then explicit
> UTF-16 encoding probably should be done. But note that the internal
> representation of the string is likely to know if the an actual string
> contains any characters with codepoints > \uffff. It may be reasonable to
> assume that strings without such characters are already DOMString encoded
> but that stings with such characters should be automatically UTF-16 encoded
> when they are passed as DOMString values.
>
> > How would
> >
> > var oneSupplemental = "\U00010000";
> I don't think I understand you literal notation. \U is a 32-bit character
> value? I whose implementation?
Sorry, please read this as
var oneSupplemental = String.fromCharCode(0x10000);
> > alert(oneSupplemental.length); // alerts 1
> >
> I'll take your word for this
If I understand, a string containing the single codepoint U+10000
should have length 1.
"The length of a String is the number of elements (i.e.,
16-bit\b\b\b\b\b\b 21-bit values) within it."
> > var utf16Encoded = encodeUTF16(oneSupplemental);
> > alert(utf16Encoded.length); // alerts 2
>
> yes
>
> > var textNode = document.createTextNode(utf16Encoded);
> > alert(textNode.nodeValue.length); // alerts ?
>
> 2
>
> > Does the DOM need to represent utf16Encoded internally so that it can
> > report 2 as the length on fetch of nodeValue?
>
> However the DOM representations DOMString values internally, to conform to
> the DOM spec. it must act as if it is representing them using UTF-16.
Ok. This seems to present two options:
(1) Break the internet by binding DOMStrings to a JavaScript host type
and not to the JavaScript string type.
(2) DOMStrings never contain supplemental codepoints.
So for either alert(typeof
var roundTripped = document.createTextNode(oneSupplemental).nodeValue
either
typeof roundTripped !== "string"
or
roundTripped.length != oneSupplemental.length
> > If so, how can it
> > represent that for systems that use a UTF-16 internal representation
> > for DOMString?
>
> Let me know if I haven't already answered this.
You might have. If you reject my assertion about option 2 above, then
to clarify,
The UTF-16 representation of codepoint U+10000 is the code-unit pair
U+D8000 U+DC000.
The UTF-16 representation of codepoint U+D8000 is the single code-unit
U+D8000 and similarly for U+DC00.
How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
implementation that uses UTF-16 under the hood from the codepoint
U+10000?
> Allen
>
More information about the es-discuss
mailing list