Are regex character class set operations (subtraction, intersection) worth the parsing complexity?

Steve steves_list at hotmail.com
Sat Dec 22 19:40:35 PST 2007


C-style syntax is hard to parse in general, but regex literals can be 
particularly tricky. However, many kinds of tools (syntax highlighters, 
minifiers, etc.) need to parse them accurately, but unfortunately most such 
ECMAScript-based tools don't (I could start naming high-profile tools with 
edge-case regex-syntax parsing bugs, but it should be obvious that it is not 
entirely trivial). ES4 regex proposals make this even harder in several 
ways, but worst of all (from a regex syntax parser complexity perspective) 
is the java.util.regex-inspired infinitely-nesting character class 
subtraction and intersection syntax.

Now, I understand that the feature is powerful (and I assume also quite 
useful in the case of regexes which make heavy use of ES4's Unicode property 
tokens), but it effectively makes it impossible to parse ES4 regex syntax 
using ES4 regexes (which lack PCRE/.NET/Perl's recursion support). And 
considering that java.util.regex is the only (major) regex library to 
include full character class set operations (.NET only does class 
subtraction), I don't think people would miss the feature that greatly.

Of course, mixing recursion support into existing regex syntax parsers is 
probably not really all that difficult in most cases, but nevertheless, I'm 
interested in what others think about the character class subtraction and 
intersection features. Personally, I think only allowing one level of 
character class nesting might be a reasonable compromise, especially since 
people could emulate more levels of nesting using lookahead anyway.
 




More information about the Es4-discuss mailing list