regexp char class subtract and intersect

Lars T Hansen lth at acm.org
Mon May 7 05:44:44 PDT 2007


These are all good points.  Java
(http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
all these problems in the following way:

-  && in a charset means intersection; this sequence is not likely to
be used except in error since a single & always suffices

- subtraction is handled through intersection with a complemented
embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised
what's now proposed, especially
subtraction-by-complement-and-intersection is unpleasant, especially
because the nested character sets makes regular expression lexing a
little messier.  (As it is, and by design, regular expression literals
can be lexed by a DFA when the program is parsed; only the regular
expression compiler needs to handle the context-free language.
Allowing nested sets breaks this.)

I'll put this item on the agenda for the next committee meeting.

Sorry for taking so long to answer,
--lars

On 4/5/07, Russ Cox <rsc at swtch.com> wrote:

> Regarding the proposal at
> http://developer.mozilla.org/es4/proposals/extend_regexps.html
> to add intersection and subtraction of character classes,
> it certainly sounds like a good idea in principle, but I am not
> sure I agree about the particulars of the proposal.  (Of course
> if the Unicode properties go off the table then it makes sense
> to toss intersection and subtraction as well; I am assuming
> that Unicode properties are still a planned addition.)
>
> First, the proposal does not specify the binding precedences
> of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
> also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
> show-stopper, just a detail missing from the rough sketch proposal.
>
> More importantly the proposal breaks the nice regexp syntax property,
> inherited from egrep, that escaped punctuation always denotes
> the corresponding literal (and that non-punctuation is always a
> literal unless escaped).  Adding \- and \& as operators breaks
> the rule and adds confusion.  Making \- special is particularly bad,
> because - is already special in character classes.  I find the - rules
> weird enough that I routinely use \- in character classes when
> I want the literal so I don't have to remember when - is special and
> when it isn't.
>
> I suggest using unescaped & as the conjunction operator,
> to preserve the property that \punctuation is always a literal.
> It is true that this would affect existing character classes that
> list &, but these can be diagnosed by the compiler as an
> empty character class: [!@#$%&*()] is obviously (to the compiler)
> a mistake, because the intersection is empty.
>
> I also suggest not adding an explicit subtraction operator but
> instead using intersection with the complement, since there is
> already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
> [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.
>
> The specific grammar would be something like:
>
>   CharClass ::= "[" ClassIntersect "]"
>   ClassComplement ::= ClassRanges | "^" ClassRanges
>   ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement
>
> Russ
> _______________________________________________
> Es4-discuss mailing list
> Es4-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es4-discuss
>



More information about the Es4-discuss mailing list