regexp char class subtract and intersect

Russ Cox rsc at swtch.com
Tue Feb 13 17:48:39 PST 2007


Regarding the proposal at
http://developer.mozilla.org/es4/proposals/extend_regexps.html
to add intersection and subtraction of character classes,
it certainly sounds like a good idea in principle, but I am not
sure I agree about the particulars of the proposal.  (Of course
if the Unicode properties go off the table then it makes sense
to toss intersection and subtraction as well; I am assuming
that Unicode properties are still a planned addition.)

First, the proposal does not specify the binding precedences
of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
show-stopper, just a detail missing from the rough sketch proposal.

More importantly the proposal breaks the nice regexp syntax property,
inherited from egrep, that escaped punctuation always denotes
the corresponding literal (and that non-punctuation is always a
literal unless escaped).  Adding \- and \& as operators breaks
the rule and adds confusion.  Making \- special is particularly bad,
because - is already special in character classes.  I find the - rules
weird enough that I routinely use \- in character classes when
I want the literal so I don't have to remember when - is special and
when it isn't.

I suggest using unescaped & as the conjunction operator,
to preserve the property that \punctuation is always a literal.
It is true that this would affect existing character classes that
list &, but these can be diagnosed by the compiler as an
empty character class: [!@#$%&*()] is obviously (to the compiler)
a mistake, because the intersection is empty.

I also suggest not adding an explicit subtraction operator but
instead using intersection with the complement, since there is
already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
[a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.

The specific grammar would be something like:

  CharClass ::= "[" ClassIntersect "]"
  ClassComplement ::= ClassRanges | "^" ClassRanges
  ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ



More information about the Es4-discuss mailing list