regexp char class subtract and intersect

Russ Cox rsc at swtch.com
Mon May 7 06:22:28 PDT 2007


> These are all good points.  Java
> (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> all these problems in the following way:
> 
> -  && in a charset means intersection; this sequence is not likely to
> be used except in error since a single & always suffices
> 
> - subtraction is handled through intersection with a complemented
> embedded charset, eg, [a-c&&[^cde]].
> 
> I think we perceived Java's solution as inelegant when we devised
> what's now proposed, especially
> subtraction-by-complement-and-intersection is unpleasant, especially
> because the nested character sets makes regular expression lexing a
> little messier.  (As it is, and by design, regular expression literals
> can be lexed by a DFA when the program is parsed; only the regular
> expression compiler needs to handle the context-free language.
> Allowing nested sets breaks this.)

That's very interesting, thanks.  Perhaps it would suffice
for drop the [ ], e.g., [a-c&&^cde].  But perhaps that would 
be too close to regular Java not to go the rest of the way.

I don't understand the final parenthetical.
Regular expressions already have nested parens,
so I don't understand how nested [ ] makes them 
harder to parse -- I thought that in both cases one
just looked for the termination character ignoring
escapes and checking of nesting.  Unless /[/]/ is a
valid way to match a slash, I don't see why the lexer
needs to care whether the [ ] are properly nested.

Russ




More information about the Es4-discuss mailing list