Questions regarding ES6 Unicode regular expressions

Mathias Bynens mathias at qiwi.be
Tue Aug 26 11:15:10 PDT 2014


On 26 Aug 2014, at 19:01, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

> I've thought about this a bit. I was initially inclined to agree with the idea of extending the existing character classes similar to what Mathias' proposes.  But I now think that is probably not a very good idea and that what is currently spec'ed (essentially that the /u flag doesn't change the meaning of \w, \d, etc.) is the better path. […] It seems to me, that we want programmers to start migrating to full Unicode regular expressions without having to do major logic rewrite of their code.  For example, ideally the above expression could simply be replaced by `parseInt(/\s*(\d+)/u.exec(input)[1])` and everything in the application could continue to work unchanged.

I see your point, but I disagree with the notion that we must absolutely maintain backwards compatibility in this case. The fact that the new flag is opt-in gives us an opportunity to improve behavior without obsessing about back-compat, similar to how the strict mode opt-in is used to make all sorts of things better. When [evangelizing `/u`](https://mathiasbynens.be/notes/es6-unicode-regex), we can educate developers and tell them to not blindly/needlessly add `/u` to their existing regular expressions.

> Instead, we should leave the definitions of \d, \w and \s unchanged and plan to adopt the already established convention that `\p{<Unicode property>}` is the notation for matching Unicode categories. See http://www.regular-expressions.info/unicode.html 

We could do both: improve `\d` and `\w` now, and add `\p{property}` and `\P{property}` later. Anyhow, I’ve filed https://bugs.ecmascript.org/show_bug.cgi?id=3157 for reserving `\p{…}`/`\P{…}`.

> I think digesting all the \p{} possibilities is too much to do for ES6, so I suggest that for ES6 that we simply reserve the \p{<characters>} and \P{<characters>} syntax within /u patterns.  A \p proposal can then be developed for ES7.

Sounds good to me.

> I see one remaining issue:
> In ES5 (and ES6): `/a-z/i`  does not match U+017F (ſ) or U+212A (K) because the ES canonicalization algorithm excludes mapping code points > 127 that toUpperCase to code points <128.
> However, as currently spec'ed, the ES6 canonicalization algorithm for /u RegExps does not include that >127/<128 exclusion.  It maps U+017F to "S" which matches. 
> This is probably a minor variation, from the ES5 behavior, but we should probably be sure it is a desirable and tolerable change as we presumably could also apply the >127/<128 filter to /u canonicalization.

This is a useful feature, and the explicit opt-in makes the small back-compat break acceptable IMHO.



More information about the es-discuss mailing list