Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Mon May 16 12:15:59 PDT 2011

On May 16, 2011, at 11:30 AM, Mike Samuel wrote:

> 2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
>> I tried to post a pointer to this strawman on this list a few weeks ago, but
>> apparently it didn't reach the list for some reason.
>> Feed back would be appreciated:
>> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
> Will this change the behavior of character groups in regular
> expressions?  Would myString.match(/^.$/)[0].length ever have length
> 2?   Would it ever match a supplemental codepoint?

No, supplement codepoints are  single string characters and RegExp matching operates on such characters.  A string could, of course, contain character sequences that correspond to UTF-8, UTF-16, or other multi-unit encodings.  However, from the perspective of Strings and RegExp those encodings would be multiple character sequences just like they are today.  The only ES functions currently proposed that would deal with multi-character encodings of supplemental codepoints are the URI handling functions.  However, it may be a good idea to add string-to-string UTF-8 and UTF-16 encode/decode functions that simply to the encode/decode and don't have all the other processing involved in encodeURI/decodeURI.

> How would the below, which replaces orphaned surrogates with U+FFFD
> when strings are viewed as sequences of UTF-16 code units behave?
> myString.replace( /[\ud800-\udbff](?![\udc00-\uffff])/g, "\ufffd")
>    .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")

Exactly as it currently does, assuming it was applied to a string that didn't contain any codepoints greater than \uffff.   If the string contained any codepoints > \uffff those character would not match the pattern should be replaced.

The important thing two keep in mind here is that under this proposal, a supplemental codepoint is a single logical charater.  For example using a random character that isn't in the BMP:
"\u+02defc" === "\ud8ff\udefc";  //this is fale
"\u+02defc".length ===1  ;//this is true
"\ud8ff\udefc"===2;  //this is true

Existing code that manipulates surrogate pairs continues to work unmodified because such code is explicitly manipulating pairs of characters.  However, such code might produce unexpected results if handed a string containing a codepoint > \uffff .  But that takes an explicit action by someone to introduce such an enhanced character into a string.

>> Allen
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss

More information about the es-discuss mailing list