regexp char class subtract and intersect
Lars T Hansen
lth at acm.org
Mon May 7 06:41:06 PDT 2007
On 5/7/07, Russ Cox <rsc at swtch.com> wrote:
> > These are all good points. Java
> > (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> > all these problems in the following way:
> > - && in a charset means intersection; this sequence is not likely to
> > be used except in error since a single & always suffices
> > - subtraction is handled through intersection with a complemented
> > embedded charset, eg, [a-c&&[^cde]].
> > I think we perceived Java's solution as inelegant when we devised
> > what's now proposed, especially
> > subtraction-by-complement-and-intersection is unpleasant, especially
> > because the nested character sets makes regular expression lexing a
> > little messier. (As it is, and by design, regular expression literals
> > can be lexed by a DFA when the program is parsed; only the regular
> > expression compiler needs to handle the context-free language.
> > Allowing nested sets breaks this.)
> That's very interesting, thanks. Perhaps it would suffice
> for drop the [ ], e.g., [a-c&&^cde]. But perhaps that would
> be too close to regular Java not to go the rest of the way.
> I don't understand the final parenthetical.
> Regular expressions already have nested parens,
> so I don't understand how nested [ ] makes them
> harder to parse -- I thought that in both cases one
> just looked for the termination character ignoring
> escapes and checking of nesting. Unless /[/]/ is a
> valid way to match a slash, I don't see why the lexer
> needs to care whether the [ ] are properly nested.
That's precisely the problem. The charset [.../...] was not legal in
ES3, the / needed to be escaped. But MSIE allows it and the web uses
it, so the other browsers have followed, and it will be legal in ES4.
More information about the Es4-discuss