Full Unicode strings strawman

Jungshik Shin (신정식, 申政湜) jungshik at google.com
Mon May 16 14:24:03 PDT 2011


On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ <mark at macchiato.com> wrote:

> I'm quite sympathetic to the goal, but the proposal does represent a
> significant breaking change. The problem, as Shawn points out, is with
> indexing. Before, the strings were defined as UTF16.


I agree with Mark wrote except that the previous spec used UCS-2, which this
proposal (and other proposals on the issue) try to rectify. I think that
taking Java's approach would work better with DOMString as well.

See W3C I18N WG's
proposal<http://www.w3.org/International/wiki/JavaScriptInternationalization>
on the issue and Java's
approach<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>linked
there)

Jungshik


>
> Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now,
> the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a'
> would be at offset 1. This will definitely cause breakage in existing code;
> characters are in different positions than they were, even characters that
> are not supplemental ones. All it takes is one supplemental character before
> the current position and the offsets will be off for the rest of the string.
>
> Faced with exactly the same problem, Java took a different approach that
> allows for handling of the full range of Unicode characters, but maintains
> backwards compatibility. It may be instructive to look at what they did
> (although there was definitely room for improvement in their approach!). I
> can follow up with that if people are interested. Alternatively, perhaps
> mechanisms can put in place to tell ECMAScript to use new vs old indexing
> (Perl uses PRAGMAs for that kind of thing, for example), although that has
> its own ugliness.
>
> Mark
>
> *— Il meglio è l’inimico del bene —*
>
>
> On Mon, May 16, 2011 at 13:38, Wes Garland <wes at page.ca> wrote:
>
>> Allen;
>>
>> Thanks for putting this together.  We use Unicode data extensively in both
>> our web and server-side applications, and being forced to deal with UTF-16
>> surrogate pair directly -- rather than letting the String implementation
>> deal with them -- is a constant source of mild pain.  At first blush, this
>> proposal looks like it meets all my needs, and my gut tells me the perf
>> impacts will probably be neutral or good.
>>
>> Two great things about strings composed of Unicode code points:
>> 1) .length represents the number of code points, rather than the number of
>> pairs used in UTF-16, even if the underlying representation isn't UTF-16
>> 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information
>> (a Unicode code point), regardless of whether X is in the BMP or not
>>
>> If though this is a breaking change from ES-5, I support it
>> whole-heartedly.... but I expect breakage to be very limited. Provided that
>> the implementation does not restrict the storage of reserved code points
>> (D800-DF00), it should be possible for users using String as immutable
>> C-arrays to keep doing so. Users doing surrogate pair decomposition will
>> probably find that their code "just works", as those code points will never
>> appear in legitimate strings of Unicode code points.  Users creating Strings
>> with surrogate pairs will need to re-tool, but this is a small burden and
>> these users will be at the upper strata of Unicode-foodom.  I suspect that
>> 99.99% of users will find that this change will fix bugs in their code when
>> dealing with non-BMP characters.
>>
>> Mike Samuel, there would never a supplement code unit to match, as the
>> return value of [[Get]] would be a code point.
>>
>> Shawn Steele, I don't understand this comment:
>>
>> Also, the “trick” I think, is encoding to surrogate pairs (illegally,
>> since UTF8 doesn’t allow that) vs decoding to UTF16.
>>
>>
>> Why do we care about the UTF-16 representation of particular codepoints?
>> Why can't the new functions just encode the Unicode string as UTF-8 and URI
>> escape it?
>>
>> Mike Samuel, can you explain why you are en/decoding UTF-16 when
>> round-tripping through the DOM?  Does the DOM specify UTF-16 encoding? If it
>> does, that's silly.  Both ES and DOM should specify "Unicode" and let the
>> data interchange format be an implementation detail.  It is an unfortunate
>> accident of history that UTF-16 surrogate pairs leak their abstraction into
>> ES Strings, and I believe it is high time we fixed that.
>>
>> Wes
>>
>> --
>> Wesley W. Garland
>> Director, Product Development
>> PageMail, Inc.
>> +1 613 542 2787 x 102
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/e7a0e72f/attachment-0001.html>


More information about the es-discuss mailing list