Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 19:20:08 PDT 2011

On May 16, 2011, at 6:41 PM, Boris Zbarsky wrote:

> On 5/16/11 6:18 PM, Allen Wirfs-Brock wrote:
>> It the string is written as \ud800\udc00\u0061" the 'a' will be at
>> offset 1, even in the new proposal. It would only be at offset 1 if it
>> was written as "\u+010000\u+000061" (using the literal notation from the
>> proposal).
> Ah, so in the proposal strings that happen to be sequences of UTF-16 units won't be automatically converted to Unicode strings?

Probably more correct to say UCS-2 units above.  Such strings could be internally represented using 16-bit character cells.  From a JS perspective, it is not possible to determine the size of the actual cell used to store individual characters of a string. Of course, at the implementation level you have to know the size of the character cell and if it is 16-bits then you know that the string doesn't contain any raw supplemental characters.  (It might contain  UTF-16 encoded character but that is at a different logical system layer.

> That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation....

Some implementations already use tree structures to represent strings that are built via concatenation.  It would be straight forward to have such a tree string representation where some segments have 16-bit cells and others 32-bit (or even 8-bit) cells. That is probably how I would present any long string that that contained only a few supplemental characters. 


More information about the es-discuss mailing list