Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Mon May 16 15:03:13 PDT 2011


A correction.

U+D800 is indeed a code point: http://www.unicode.org/glossary/#Code_Point. It
is defined for usage in Unicode Strings (see
http://www.unicode.org/glossary/#Unicode_String) because often it is useful
for implementations to be able to allow it in processing.

It does, however, have a special status, and is not representable in
well-formed UTF-xx, for general interchange.

A quick note on the intro to the doc, with a bit more history.

> ECMAScript currently only directly supports the 16-bit basic multilingual
plane (BMP) subset of Unicode which is all that existed when ECMAScript was
first designed. Since then Unicode has been extended to require up to
21-bits per code.

   1. Unicode was extended to to up to 10FFFF in version 2.0, in July of
   1996.
   2. ECMAScript, according to Wikipedia, was first issued in 1997. So
   actually for all of ECMAScript's existence, it has been obsolete in its
   usage of Unicode.
      - (It isn't quite as bad as that, since we made provision for
supplementary characters
      early-on, but the first *actual* supplementary characters appeared in
      2003.)
   3. In 2003, Markus Scherer proposed support for Unicode in ECMAScript v4:
      1. https://sites.google.com/site/markusicu/unicode/es/unicode-2003
      2. https://sites.google.com/site/markusicu/unicode/es/i18n-2003


Mark

*— Il meglio è l’inimico del bene —*


On Mon, May 16, 2011 at 14:42, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 5/16/11 4:38 PM, Wes Garland wrote:
>
>> Two great things about strings composed of Unicode code points:
>>
> ...
>
>  If though this is a breaking change from ES-5, I support it
>> whole-heartedly.... but I expect breakage to be very limited. Provided
>> that the implementation does not restrict the storage of reserved code
>> points (D800-DF00)
>>
>
> Those aren't code points at all.  They're just not Unicode.
>
> If you allow storage of such, then you're allowing mixing Unicode strings
> and "something else" (whatever the something else is), with bad most likely
> bad results.
>
> Most simply, assignign a DOMString containing surrogates to a JS string
> should collapse the surrogate pairs into the corresponding codepoint if JS
> strings really contain codepoints...
>
> The only way to make this work is if either DOMString is redefined or
> DOMString and full Unicode strings are different kinds of objects.
>
>
>  Users doing surrogate pair decomposition will probably find that their
>> code "just works"
>>
>
> How, exactly?
>
>
>  Users creating Strings with surrogate pairs will need to
>> re-tool
>>
>
> Such users would include the DOM, right?
>
>
>  but this is a small burden and these users will be at the upper
>> strata of Unicode-foodom.
>>
>
> You're talking every single web developer here.  Or at least every single
> web developer who wants to work with Devanagari text.
>
>
>  I suspect that 99.99% of users will find that
>> this change will fix bugs in their code when dealing with non-BMP
>> characters.
>>
>
> Not unless DOMString is changed or the interaction between the two very
> carefully defined in failure-proof ways.
>
>
>  Why do we care about the UTF-16 representation of particular
>> codepoints?
>>
>
> Because of DOMString's use of UTF-16, at least (forced on it by the fact
> that that's what ES used to do, but here we are).
>
>
>  Mike Samuel, can you explain why you are en/decoding UTF-16 when
>> round-tripping through the DOM?  Does the DOM specify UTF-16 encoding?
>>
>
> Yes.
>
>
>  If it does, that's silly.
>>
>
> It needed to specify _something_, and UTF-16 was the thing that was
> compatible with how scripts work in ES.  Not to mention the Java legacy if
> the DOM...
>
>
>  Both ES and DOM should specify "Unicode" and let the data interchange
>> format be an implementation detail.
>>
>
> That's fine if _both_ are changed.  Changing just one without the other
> would just cause problems.
>
>
>  It is an unfortunate accident of history that UTF-16 surrogate pairs leak
>> their
>> abstraction into ES Strings, and I believe it is high time we fixed that.
>>
>
> If you can do that without breaking web pages, great.  If not, then we need
> to talk.  ;)
>
> -Boris
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/000a50b5/attachment-0001.html>


More information about the es-discuss mailing list