Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 14:56:38 PDT 2011


On May 16, 2011, at 1:38 PM, Wes Garland wrote:

> Allen;
> 
> Thanks for putting this together.  We use Unicode data extensively in both our web and server-side applications, and being forced to deal with UTF-16 surrogate pair directly -- rather than letting the String implementation deal with them -- is a constant source of mild pain.  At first blush, this proposal looks like it meets all my needs, and my gut tells me the perf impacts will probably be neutral or good. 
> 
> Two great things about strings composed of Unicode code points:
> 1) .length represents the number of code points, rather than the number of pairs used in UTF-16, even if the underlying representation isn't UTF-16
> 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information (a Unicode code point), regardless of whether X is in the BMP or not
> 
> If though this is a breaking change from ES-5, I support it whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00), it should be possible for users using String as immutable C-arrays to keep doing so. Users doing surrogate pair decomposition will probably find that their code "just works", as those code points will never appear in legitimate strings of Unicode code points.  Users creating Strings with surrogate pairs will need to re-tool, but this is a small burden and these users will be at the upper strata of Unicode-foodom.  I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters.

Thanks, this is exactly my thinking on the subject.





More information about the es-discuss mailing list