Full Unicode strings strawman
Allen Wirfs-Brock
allen at wirfs-brock.com
Mon May 16 19:20:08 PDT 2011
On May 16, 2011, at 6:41 PM, Boris Zbarsky wrote:
> On 5/16/11 6:18 PM, Allen Wirfs-Brock wrote:
>> It the string is written as \ud800\udc00\u0061" the 'a' will be at
>> offset 1, even in the new proposal. It would only be at offset 1 if it
>> was written as "\u+010000\u+000061" (using the literal notation from the
>> proposal).
>
> Ah, so in the proposal strings that happen to be sequences of UTF-16 units won't be automatically converted to Unicode strings?
Probably more correct to say UCS-2 units above. Such strings could be internally represented using 16-bit character cells. From a JS perspective, it is not possible to determine the size of the actual cell used to store individual characters of a string. Of course, at the implementation level you have to know the size of the character cell and if it is 16-bits then you know that the string doesn't contain any raw supplemental characters. (It might contain UTF-16 encoded character but that is at a different logical system layer.
>
> That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation....
Some implementations already use tree structures to represent strings that are built via concatenation. It would be straight forward to have such a tree string representation where some segments have 16-bit cells and others 32-bit (or even 8-bit) cells. That is probably how I would present any long string that that contained only a few supplemental characters.
Allen
More information about the es-discuss
mailing list