regexp char class subtract and intersect
rsc at swtch.com
Mon May 7 06:22:28 PDT 2007
> These are all good points. Java
> (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> all these problems in the following way:
> - && in a charset means intersection; this sequence is not likely to
> be used except in error since a single & always suffices
> - subtraction is handled through intersection with a complemented
> embedded charset, eg, [a-c&&[^cde]].
> I think we perceived Java's solution as inelegant when we devised
> what's now proposed, especially
> subtraction-by-complement-and-intersection is unpleasant, especially
> because the nested character sets makes regular expression lexing a
> little messier. (As it is, and by design, regular expression literals
> can be lexed by a DFA when the program is parsed; only the regular
> expression compiler needs to handle the context-free language.
> Allowing nested sets breaks this.)
That's very interesting, thanks. Perhaps it would suffice
for drop the [ ], e.g., [a-c&&^cde]. But perhaps that would
be too close to regular Java not to go the rest of the way.
I don't understand the final parenthetical.
Regular expressions already have nested parens,
so I don't understand how nested [ ] makes them
harder to parse -- I thought that in both cases one
just looked for the termination character ignoring
escapes and checking of nesting. Unless /[/]/ is a
valid way to match a slash, I don't see why the lexer
needs to care whether the [ ] are properly nested.
More information about the Es4-discuss