UTF-16 Strings not-strawman

Shawn Steele Shawn.Steele at microsoft.com
Thu May 19 11:17:52 PDT 2011


I don’t have time to make a real strawman, but what would people need if we went the UTF-16 route (instead of full-Unicode)?  (This thread is to collect requirements, which are somewhat getting lost in the merits of UTF-16 vs 32 bit thread).  Basically, just replace UCS-2 with UTF-16, allowing irregular UTF-16 for compatibility.

Things that come to mind immediately are:

·         Some sort of convenience notation for string literals and regular expressions.

·         Extend string.fromCharCode() to allow generating UTF-16 pairs for values 10000-10ffff.

·         Something to allow values 10000-10ffff from string.charCodeAt.  I assume it’d have to be new function.

·         Make encodeURIcomponent and decodeURIcomponent use UTF-8 instead of CESU-8.  (The current behavior actually breaks the specifications because CESU-8 as generated != UTF-8 as defined, but I’m not sure the bug can be fixed.)  So either fix the bug (probably too breaking?) or make at least a new “correctlyEncodeURIcomponent”.  (I don’t think decoding is breaking).

Things I’m less certain about:

·         There is apparently some desire to walk a string by +=1 or +=2 depending on if it’s a surrogate pair or not.  I’m not sure it’s worth formalizing, as, to me, it’s more interesting to walk it by graphemes or other more appropriate text elements.  And most applications don’t seem to care much about whether they break strings.

·         A strict mode that disallows the irregular UTF-16?

- Shawn

 
http://blogs.msdn.com/shawnste

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110519/10557304/attachment-0001.html>


More information about the es-discuss mailing list