Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 15:18:06 PDT 2011


On May 16, 2011, at 2:19 PM, Mark Davis ☕ wrote:

> I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16.

Not by the ECMAScript specification

> 
> Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' would be at offset 1.

It the string is written as   \ud800\udc00\u0061" the 'a' will be at offset 1, even in the new proposal.  It would only be at offset 1 if it was written as "\u+010000\u+000061"  (using the literal notation from the proposal).

> This will definitely cause breakage in existing code;

How does this break existing code.  Existing code can not say "\u+010000\u+000061".  As I've pointed out elsewhere on this thread existing libraries that do UTF-16 encoding/decoding must continue to do so even under this new proposal. 

> characters are in different positions than they were, even characters that are not supplemental ones. All it takes is one supplemental character before the current position and the offsets will be off for the rest of the string.


Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/25b87d39/attachment.html>


More information about the es-discuss mailing list