Questions regarding ES6 Unicode regular expressions

Allen Wirfs-Brock allen at wirfs-brock.com
Tue Aug 26 10:01:20 PDT 2014


I've thought about this a bit. I was initially inclined to agree with the idea of extending the existing character classes similar to what Mathias' proposes.  But I now think that is probably not a very good idea and that what is currently spec'ed (essentially that the /u flag doesn't change the meaning of \w, \d, etc.) is the better path.

The basic issue I see is backwards compatibility and evolving code to using /u patterns.

I suspect that there is plenty of JS code in the world that does something more or less equivalent to `parseInt(/\s*(\d+)/.exec(input)[1])`

Note that `parseInt` is only prepared to recognize the digits U+0030-U+0039.

I t seems to me, that we want programmers to start migrating to full Unicode regular expressions without having to do major logic rewrite of their code.  For example, ideally the above expression could simply be replaced by `parseInt(/\s*(\d+)/u.exec(input)[1])` and everything in the application could continue to work unchanged.

That won't be the case if we redefine, as Mathias proposes, `/\d/u` to be equivalent to 
`/[0-9\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0BE6-\u0BEF\u0C66-\u0C6F\u0CE6-\u0CEF\u0D66-\u0D6F\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F29\u1040-\u1049\u1090-\u1099\u17E0-\u17E9\u1810-\u1819\u1946-\u194F\u19D0-\u19D9\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\uA620-\uA629\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]|\uD801[\uDCA0-\uDCA9]|\uD804[\uDC66-\uDC6F\uDCF0-\uDCF9\uDD36-\uDD3F\uDDD0-\uDDD9\uDEF0-\uDEF9]|\uD805[\uDCD0-\uDCD9\uDE50-\uDE59\uDEC0-\uDEC9]|\uD806[\uDCE0-\uDCE9]|\uD81A[\uDE60-\uDE69\uDF50-\uDF59]|\uD835[\uDFCE-\uDFFF]/u`
rather than

`/[0-9]/u`
We can apply similar logic to \w and even \s.
Instead, we should leave the definitions of \d, \w and \s unchanged and plan to adopt the already established convention that `\p{<Unicode property>}` is the notation for matching Unicode categories. See http://www.regular-expressions.info/unicode.html 
I think digesting all the \p{} possibilities is too much to do for ES6, so I suggest that for ES6 that we simply reserve the \p{<characters>} and \P{<characters>} syntax within /u patterns.  A \p proposal can then be developed for ES7.
I see one remaining issue:
In ES5 (and ES6): `/a-z/i`  does not match U+017F (ſ) or U+212A (K) because the ES canonicalization algorithm excludes mapping code points > 127 that toUpperCase to code points <128.
However, as currently spec'ed, the ES6 canonicalization algorithm for /u RegExps does not include that >127/<128 exclusion.  It maps U+017F to "S" which matches. 
This is probably a minor variation, from the ES5 behavior, but we should probably be sure it is a desirable and tolerable change as we presumably could also apply the >127/<128 filter to /u canonicalization.

So, here is a summary of my proposal:
1) don't change the current definitions of \d, \w, \s when used in /u regular expressions.
2) Decide whether the current ES6 /u canonicalization algorithm is correct or if it should not translated code points > 127 that map to code points <128.
3) Reserve within /u RegExp patterns, the syntax \p{<characters>} and \P{<characters>}
4) Start to develop a \p{ } proposal for ES7.
Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20140826/13f3f27b/attachment-0001.html>


More information about the es-discuss mailing list