Full Unicode strings strawman
allen at wirfs-brock.com
Mon May 16 14:53:31 PDT 2011
On May 16, 2011, at 1:37 PM, Mike Samuel wrote:
> 2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
>>> How would
>>> var oneSupplemental = "\U00010000";
>> I don't think I understand you literal notation. \U is a 32-bit character
>> value? I whose implementation?
> Sorry, please read this as
> var oneSupplemental = String.fromCharCode(0x10000);
In my proposal you would have to say String.fromCodepoint(0x10000);
In ES5 String.fromCharCode(0x10000) produced the same string as "\0". That remains the case in my proposal.
>>> alert(oneSupplemental.length); // alerts 1
>> I'll take your word for this
> If I understand, a string containing the single codepoint U+10000
> should have length 1.
> "The length of a String is the number of elements (i.e.,
> 16-bit\b\b\b\b\b\b 21-bit values) within it."
yes, it's 1. My gruff comment was in reference to not being sure of your literal notation
>>> var utf16Encoded = encodeUTF16(oneSupplemental);
>>> alert(utf16Encoded.length); // alerts 2
>>> var textNode = document.createTextNode(utf16Encoded);
>>> alert(textNode.nodeValue.length); // alerts ?
>>> Does the DOM need to represent utf16Encoded internally so that it can
>>> report 2 as the length on fetch of nodeValue?
>> However the DOM representations DOMString values internally, to conform to
>> the DOM spec. it must act as if it is representing them using UTF-16.
> Ok. This seems to present two options:
Not sure why this would break the internet. At the implementation level, a key point of my proposal is that implementations can (and even today some do) have multiple different internal representations for strings. These internal representation difference simply are not exposed to the JS program expect possiblely in terms to measurable performance differences.
> (2) DOMStrings never contain supplemental codepoints.
that's how DOMStrings are currently defined and I'm not proposing to change this. Adding full unicode DOMStrings to the DOM spec. seems like a task for W3C.
> So for either alert(typeof
> var roundTripped = document.createTextNode(oneSupplemental).nodeValue
> typeof roundTripped !== "string"
Not really, I'm perfectly happy to allow the DOM to continue report the type of DOMString as 'string'. It's no different from a user constructed string that may or may not contain a UTF-16 character sequence depending upon what the user code.
> roundTripped.length != oneSupplemental.length
Yes, this may be the case but only for new code that explicit builds oneSupplemental to contain a supplemental character using \u+xxxxxx or String.fromCodepoint or some other new function. All existing valid code only produces strings with codepoints limited to 0xffff
>>> If so, how can it
>>> represent that for systems that use a UTF-16 internal representation
>>> for DOMString?
>> Let me know if I haven't already answered this.
> You might have. If you reject my assertion about option 2 above, then
> to clarify,
> The UTF-16 representation of codepoint U+10000 is the code-unit pair
> U+D8000 U+DC000.
> The UTF-16 representation of codepoint U+D8000 is the single code-unit
> U+D8000 and similarly for U+DC00.
> How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
> implementation that uses UTF-16 under the hood from the codepoint
I think you have an extra 0 at a couple of places above...
A DOMstring is defined by the DOM spec. to consists of 16-bit elements that are to be interpreted as a UTF-16 encoding of Unicode characters. It doesn't matter what implementation level representation is used for the string, the indexible positions within a DOMString is restricted to 16-bit values. At the representation level each position could even be represented by a 32-bit cell and it doesn't matter. To be a valid DOMString element values must be in the range 0-0xffff.
I think you are unnecessarily mixing up the string semantics defined by the language, encodings that might be used in implementing the semantics, and application level processing of those strings.
To simplify things just think of a ES string as if it was an array each element of which could contain an arbitrary integer value. If we have such an array like [0xd800, 0xdc00] at the language semantics level this is a two element array containing two well specific values. At the language implementation level there are all sorts of representations that might be used, maybe the implementation Huffman encodes the elements... How the application processes that array is completely up to the application. It may treat the array simply as two integer values. It may treated each element as a 21-bit value encoding a Unicode codepoint and logically consider the array to be a unicode string of length 2. It may consider each element to be a 16-bit value and that sequences of values are interpreted as UTF-16 string encodings. In that case, it could consider it to represent a string of logical length 1.
This is no different from what people do today with 16-bit char JS strings. Many people just treat them as strings of BMP characters and ignore the possibility of supplemental characters or UTF-16 encodings. Other people (particularly when dealing with DOMStrings) treat strings as code units of an UTF-16 encoding. They need to use more complex sting processing algorithms to deal with logical unicode characters.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss