Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Mon May 16 14:19:48 PDT 2011


I'm quite sympathetic to the goal, but the proposal does represent a
significant breaking change. The problem, as Shawn points out, is with
indexing. Before, the strings were defined as UTF16.

Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now,
the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a'
would be at offset 1. This will definitely cause breakage in existing code;
characters are in different positions than they were, even characters that
are not supplemental ones. All it takes is one supplemental character before
the current position and the offsets will be off for the rest of the string.

Faced with exactly the same problem, Java took a different approach that
allows for handling of the full range of Unicode characters, but maintains
backwards compatibility. It may be instructive to look at what they did
(although there was definitely room for improvement in their approach!). I
can follow up with that if people are interested. Alternatively, perhaps
mechanisms can put in place to tell ECMAScript to use new vs old indexing
(Perl uses PRAGMAs for that kind of thing, for example), although that has
its own ugliness.

Mark

*— Il meglio è l’inimico del bene —*


On Mon, May 16, 2011 at 13:38, Wes Garland <wes at page.ca> wrote:

> Allen;
>
> Thanks for putting this together.  We use Unicode data extensively in both
> our web and server-side applications, and being forced to deal with UTF-16
> surrogate pair directly -- rather than letting the String implementation
> deal with them -- is a constant source of mild pain.  At first blush, this
> proposal looks like it meets all my needs, and my gut tells me the perf
> impacts will probably be neutral or good.
>
> Two great things about strings composed of Unicode code points:
> 1) .length represents the number of code points, rather than the number of
> pairs used in UTF-16, even if the underlying representation isn't UTF-16
> 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information
> (a Unicode code point), regardless of whether X is in the BMP or not
>
> If though this is a breaking change from ES-5, I support it
> whole-heartedly.... but I expect breakage to be very limited. Provided that
> the implementation does not restrict the storage of reserved code points
> (D800-DF00), it should be possible for users using String as immutable
> C-arrays to keep doing so. Users doing surrogate pair decomposition will
> probably find that their code "just works", as those code points will never
> appear in legitimate strings of Unicode code points.  Users creating Strings
> with surrogate pairs will need to re-tool, but this is a small burden and
> these users will be at the upper strata of Unicode-foodom.  I suspect that
> 99.99% of users will find that this change will fix bugs in their code when
> dealing with non-BMP characters.
>
> Mike Samuel, there would never a supplement code unit to match, as the
> return value of [[Get]] would be a code point.
>
> Shawn Steele, I don't understand this comment:
>
> Also, the “trick” I think, is encoding to surrogate pairs (illegally, since
> UTF8 doesn’t allow that) vs decoding to UTF16.
>
>
> Why do we care about the UTF-16 representation of particular codepoints?
> Why can't the new functions just encode the Unicode string as UTF-8 and URI
> escape it?
>
> Mike Samuel, can you explain why you are en/decoding UTF-16 when
> round-tripping through the DOM?  Does the DOM specify UTF-16 encoding? If it
> does, that's silly.  Both ES and DOM should specify "Unicode" and let the
> data interchange format be an implementation detail.  It is an unfortunate
> accident of history that UTF-16 surrogate pairs leak their abstraction into
> ES Strings, and I believe it is high time we fixed that.
>
> Wes
>
> --
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/d7cd4102/attachment-0001.html>


More information about the es-discuss mailing list