Are regex character class set operations (subtraction, intersection) worth the parsing complexity?
steves_list at hotmail.com
Sat Dec 22 19:40:35 PST 2007
C-style syntax is hard to parse in general, but regex literals can be
particularly tricky. However, many kinds of tools (syntax highlighters,
minifiers, etc.) need to parse them accurately, but unfortunately most such
ECMAScript-based tools don't (I could start naming high-profile tools with
edge-case regex-syntax parsing bugs, but it should be obvious that it is not
entirely trivial). ES4 regex proposals make this even harder in several
ways, but worst of all (from a regex syntax parser complexity perspective)
is the java.util.regex-inspired infinitely-nesting character class
subtraction and intersection syntax.
Now, I understand that the feature is powerful (and I assume also quite
useful in the case of regexes which make heavy use of ES4's Unicode property
tokens), but it effectively makes it impossible to parse ES4 regex syntax
using ES4 regexes (which lack PCRE/.NET/Perl's recursion support). And
considering that java.util.regex is the only (major) regex library to
include full character class set operations (.NET only does class
subtraction), I don't think people would miss the feature that greatly.
Of course, mixing recursion support into existing regex syntax parsers is
probably not really all that difficult in most cases, but nevertheless, I'm
interested in what others think about the character class subtraction and
intersection features. Personally, I think only allowing one level of
character class nesting might be a reasonable compromise, especially since
people could emulate more levels of nesting using lookahead anyway.
More information about the Es4-discuss