Full Unicode based on UTF-16 proposal

Steven Levithan steves_list at hotmail.com
Fri Mar 23 06:30:28 PDT 2012


Norbert Lindenberg wrote:

> I've updated the proposal based on the feedback received so far. Changes
> are listed in the Updates section.
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/

Cool.

>From the proposal's Updates section:

> Indicated that "u" may not be the actual character for the flag for code
> point mode in regular expressions, as a "u" flag has already been proposed
> for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do three 
things at once, making it an all-around "support Unicode better" flag:

1. Switches from code unit to code point mode. /./gu matches any Unicode 
code point, among other benefits outlined by Norbert.

2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. 
[0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match 
ASCII characters only while using /u.

3. [New proposal] Makes /i use Unicode casefolding rules. 
/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

Item number 3 is inspired by but different than Java's lowercase u flag for 
Unicode casefolding. In Java, flag u itself enables Unicode casefolding and 
does not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeing 
widespread use when dealing with anything more than ASCII, at least in 
environments where you don't have to worry about backcompat. This would help 
developers avoid stumbling on code unit issues in the small minority of 
cases where non-BMP characters are used or encountered. If /u's only purpose 
was to switch to code point mode, most likely it would be used *far* less 
often, and more developers would continue to get bitten by code-unit-based 
processing.

As for whether the switch to code-point-based matching should be universal 
or require /u (an issue that your proposal leaves open), IMHO it's better to 
require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] 
to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to 
[{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three 
potentially breaking changes (two of which are explicitly mentioned in your 
proposal):

1. "[S]ome applications might have processed gunk with regular expressions 
where neither the 'characters' in the patterns nor the input to be matched 
are text."

2. "s.match(/^.$/)[0].length can now be 2."
I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.

3. /./g.exec(s) can now increment the regex's lastIndex by 2.

-- Steven Levithan




More information about the es-discuss mailing list