Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Mon May 16 14:44:37 PDT 2011


In terms of implementation capabilities, there isn't really a significant
practical difference between

   - a UCS-2 implementation, and
   - a UTS-16 implementation that doesn't have supplemental characters in
   its supported repertoire.


Mark

*— Il meglio è l’inimico del bene —*


On Mon, May 16, 2011 at 14:28, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

>  I think the problem isn’t so much that the spec used UCS-2, but rather
> that some implementations used UTF-16 instead as that is more convenient in
> many cases.  To the application developer, it’s difficult to tell the
> difference between UCS-2 and UTF-16 if I can use a regular expression to
> find D800, DC00.  Indeed, when the rendering engine of whatever host is
> going to display the glyph for U+10000, it’d be hard to notice the subtlety
> of UCS-2 vs UTF-16.
>
>
>
> -Shawn
>
>
>
> *From:* es-discuss-bounces at mozilla.org [mailto:
> es-discuss-bounces at mozilla.org] *On Behalf Of *Jungshik Shin (???, ???)
> *Sent:* Monday, May 16, 2011 2:24 PM
> *To:* Mark Davis ☕
> *Cc:* Markus Scherer; es-discuss at mozilla.org
>
> *Subject:* Re: Full Unicode strings strawman
>
>
>
>
>
> On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ <mark at macchiato.com> wrote:
>
> I'm quite sympathetic to the goal, but the proposal does represent a
> significant breaking change. The problem, as Shawn points out, is with
> indexing. Before, the strings were defined as UTF16.
>
>
>
> I agree with Mark wrote except that the previous spec used UCS-2, which
> this proposal (and other proposals on the issue) try to rectify. I think
> that taking Java's approach would work better with DOMString as well.
>
>
>
> See W3C I18N WG's proposal<http://www.w3.org/International/wiki/JavaScriptInternationalization>
> on the issue and Java's approach<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>linked there)
>
>
>
> Jungshik
>
>
>
>
>
> Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now,
> the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a'
> would be at offset 1. This will definitely cause breakage in existing code;
> characters are in different positions than they were, even characters that
> are not supplemental ones. All it takes is one supplemental character before
> the current position and the offsets will be off for the rest of the string.
>
>
>
> Faced with exactly the same problem, Java took a different approach that
> allows for handling of the full range of Unicode characters, but maintains
> backwards compatibility. It may be instructive to look at what they did
> (although there was definitely room for improvement in their approach!). I
> can follow up with that if people are interested. Alternatively, perhaps
> mechanisms can put in place to tell ECMAScript to use new vs old indexing
> (Perl uses PRAGMAs for that kind of thing, for example), although that has
> its own ugliness.
>
>
>
> Mark
>
> *— Il meglio è l’inimico del bene —*
>
>   On Mon, May 16, 2011 at 13:38, Wes Garland <wes at page.ca> wrote:
>
>  Allen;
>
> Thanks for putting this together.  We use Unicode data extensively in both
> our web and server-side applications, and being forced to deal with UTF-16
> surrogate pair directly -- rather than letting the String implementation
> deal with them -- is a constant source of mild pain.  At first blush, this
> proposal looks like it meets all my needs, and my gut tells me the perf
> impacts will probably be neutral or good.
>
> Two great things about strings composed of Unicode code points:
> 1) .length represents the number of code points, rather than the number of
> pairs used in UTF-16, even if the underlying representation isn't UTF-16
> 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information
> (a Unicode code point), regardless of whether X is in the BMP or not
>
> If though this is a breaking change from ES-5, I support it
> whole-heartedly.... but I expect breakage to be very limited. Provided that
> the implementation does not restrict the storage of reserved code points
> (D800-DF00), it should be possible for users using String as immutable
> C-arrays to keep doing so. Users doing surrogate pair decomposition will
> probably find that their code "just works", as those code points will never
> appear in legitimate strings of Unicode code points.  Users creating Strings
> with surrogate pairs will need to re-tool, but this is a small burden and
> these users will be at the upper strata of Unicode-foodom.  I suspect that
> 99.99% of users will find that this change will fix bugs in their code when
> dealing with non-BMP characters.
>
> Mike Samuel, there would never a supplement code unit to match, as the
> return value of [[Get]] would be a code point.
>
> Shawn Steele, I don't understand this comment:
>
>
>
> Also, the “trick” I think, is encoding to surrogate pairs (illegally, since
> UTF8 doesn’t allow that) vs decoding to UTF16.
>
>
> Why do we care about the UTF-16 representation of particular codepoints?
> Why can't the new functions just encode the Unicode string as UTF-8 and URI
> escape it?
>
> Mike Samuel, can you explain why you are en/decoding UTF-16 when
> round-tripping through the DOM?  Does the DOM specify UTF-16 encoding? If it
> does, that's silly.  Both ES and DOM should specify "Unicode" and let the
> data interchange format be an implementation detail.  It is an unfortunate
> accident of history that UTF-16 surrogate pairs leak their abstraction into
> ES Strings, and I believe it is high time we fixed that.
>
> Wes
>
> --
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/bf264a37/attachment.html>


More information about the es-discuss mailing list