Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 19:57:45 PDT 2011


On May 16, 2011, at 7:22 PM, Boris Zbarsky wrote:

> On 5/16/11 10:20 PM, Allen Wirfs-Brock wrote:
>>> That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation....
>> 
>> Some implementations already use tree structures to represent strings that are built via concatenation.  It would be straight forward to have such a tree string representation where some segments have 16-bit cells and others 32-bit (or even 8-bit) cells. That is probably how I would present any long string that that contained only a few supplemental characters.
> 
> I'm not talking about the implementation end.  I can see how I'd implement this stuff, or make Luke implement it or something.  What I don't see is how the JS program author can sanely work with the result.
> 

In theory, the JS programmer already has to manually keep track of where or not a string value is UTF-16 or UCS-2.  As John Tamplin observed in https://mail.mozilla.org/pipermail/es-discuss/2011-May/014319.html most JS programmer simply assume they are dealing with the BMP and trip-up if they actually have to process a surrogate pair that was unexpectedly handed to them form the DOM.

That said, it was be easy enough to expand proposal with a JS programmer visible property on string values that said whether or not the string was known to be UTF-16 encoded or not, similarly a flag for UTF-32 encode.  Presumably all values returned from the DOM as DOMStrings would have the property set. Strings produced by a UTF16Decde function or explicitly constructed containing supplementarity characters would get the UTF-32 flag.  If you concatenated one of each it would get neither flag. 

Allen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/05b9aa8a/attachment-0001.html>


More information about the es-discuss mailing list