Questions regarding ES6 Unicode regular expressions

Claude Pache claude.pache at gmail.com
Tue Aug 26 12:45:17 PDT 2014


Le 26 août 2014 à 20:15, Mathias Bynens <mathias at qiwi.be> a écrit :

> On 26 Aug 2014, at 19:01, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
> 
>> I've thought about this a bit. I was initially inclined to agree with the idea of extending the existing character classes similar to what Mathias' proposes.  But I now think that is probably not a very good idea and that what is currently spec'ed (essentially that the /u flag doesn't change the meaning of \w, \d, etc.) is the better path. […] It seems to me, that we want programmers to start migrating to full Unicode regular expressions without having to do major logic rewrite of their code.  For example, ideally the above expression could simply be replaced by `parseInt(/\s*(\d+)/u.exec(input)[1])` and everything in the application could continue to work unchanged.
> 
> I see your point, but I disagree with the notion that we must absolutely maintain backwards compatibility in this case. The fact that the new flag is opt-in gives us an opportunity to improve behavior without obsessing about back-compat, similar to how the strict mode opt-in is used to make all sorts of things better. When [evangelizing `/u`](https://mathiasbynens.be/notes/es6-unicode-regex), we can educate developers and tell them to not blindly/needlessly add `/u` to their existing regular expressions.
> 
>> Instead, we should leave the definitions of \d, \w and \s unchanged and plan to adopt the already established convention that `\p{<Unicode property>}` is the notation for matching Unicode categories. See http://www.regular-expressions.info/unicode.html 
> 
> We could do both: improve `\d` and `\w` now, and add `\p{property}` and `\P{property}` later. Anyhow, I’ve filed https://bugs.ecmascript.org/show_bug.cgi?id=3157 for reserving `\p{…}`/`\P{…}`.

The meaning of `\d` should not be changed; it is routinely used as a synonym of `[0-9]`. Changing its meaning is willfully introducing traps in the language, and it *will* produce bugs, for very little gain. It is much safer to learn to use `\pN` in the rare situations where one want to match numerical characters in any script.

For `\w` and `\b`, on the other hand, it can be corrected, because nobody would normally consider that there is two word boundaries in the middle of "fiancée", and it is not a useful semantics, especially in Unicode-aware contexts (that is, in situations where you should use the `u` flag).

—Claude




More information about the es-discuss mailing list