Full Unicode strings strawman
wes at page.ca
Mon May 16 13:38:21 PDT 2011
Thanks for putting this together. We use Unicode data extensively in both
our web and server-side applications, and being forced to deal with UTF-16
surrogate pair directly -- rather than letting the String implementation
deal with them -- is a constant source of mild pain. At first blush, this
proposal looks like it meets all my needs, and my gut tells me the perf
impacts will probably be neutral or good.
Two great things about strings composed of Unicode code points:
1) .length represents the number of code points, rather than the number of
pairs used in UTF-16, even if the underlying representation isn't UTF-16
2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information (a
Unicode code point), regardless of whether X is in the BMP or not
If though this is a breaking change from ES-5, I support it
whole-heartedly.... but I expect breakage to be very limited. Provided that
the implementation does not restrict the storage of reserved code points
(D800-DF00), it should be possible for users using String as immutable
C-arrays to keep doing so. Users doing surrogate pair decomposition will
probably find that their code "just works", as those code points will never
appear in legitimate strings of Unicode code points. Users creating Strings
with surrogate pairs will need to re-tool, but this is a small burden and
these users will be at the upper strata of Unicode-foodom. I suspect that
99.99% of users will find that this change will fix bugs in their code when
dealing with non-BMP characters.
Mike Samuel, there would never a supplement code unit to match, as the
return value of [[Get]] would be a code point.
Shawn Steele, I don't understand this comment:
Also, the “trick” I think, is encoding to surrogate pairs (illegally, since
UTF8 doesn’t allow that) vs decoding to UTF16.
Why do we care about the UTF-16 representation of particular codepoints?
Why can't the new functions just encode the Unicode string as UTF-8 and URI
Mike Samuel, can you explain why you are en/decoding UTF-16 when
round-tripping through the DOM? Does the DOM specify UTF-16 encoding? If it
does, that's silly. Both ES and DOM should specify "Unicode" and let the
data interchange format be an implementation detail. It is an unfortunate
accident of history that UTF-16 surrogate pairs leak their abstraction into
ES Strings, and I believe it is high time we fixed that.
Wesley W. Garland
Director, Product Development
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss