regexp char class subtract and intersect

Russ Cox rsc at swtch.com
Thu Apr 5 08:37:56 PDT 2007


[Reposting because I never saw any replies nor any notes
on the wiki, in case this got lost in the shuffle.]

From: Russ Cox <rsc at swtch.com>
Date: Feb 13, 2007 9:48 PM
Subject: regexp char class subtract and intersect
To: es4-discuss at mozilla.org

Regarding the proposal at
http://developer.mozilla.org/es4/proposals/extend_regexps.html
to add intersection and subtraction of character classes,
it certainly sounds like a good idea in principle, but I am not
sure I agree about the particulars of the proposal.  (Of course
if the Unicode properties go off the table then it makes sense
to toss intersection and subtraction as well; I am assuming
that Unicode properties are still a planned addition.)

First, the proposal does not specify the binding precedences
of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
show-stopper, just a detail missing from the rough sketch proposal.

More importantly the proposal breaks the nice regexp syntax property,
inherited from egrep, that escaped punctuation always denotes
the corresponding literal (and that non-punctuation is always a
literal unless escaped).  Adding \- and \& as operators breaks
the rule and adds confusion.  Making \- special is particularly bad,
because - is already special in character classes.  I find the - rules
weird enough that I routinely use \- in character classes when
I want the literal so I don't have to remember when - is special and
when it isn't.

I suggest using unescaped & as the conjunction operator,
to preserve the property that \punctuation is always a literal.
It is true that this would affect existing character classes that
list &, but these can be diagnosed by the compiler as an
empty character class: [!@#$%&*()] is obviously (to the compiler)
a mistake, because the intersection is empty.

I also suggest not adding an explicit subtraction operator but
instead using intersection with the complement, since there is
already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
[a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.

The specific grammar would be something like:

  CharClass ::= "[" ClassIntersect "]"
  ClassComplement ::= ClassRanges | "^" ClassRanges
  ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ



More information about the Es4-discuss mailing list