regexp char class subtract and intersect
Lars T Hansen
lth at acm.org
Mon May 7 06:41:06 PDT 2007
On 5/7/07, Russ Cox <rsc at swtch.com> wrote:
> > These are all good points. Java
> > (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> > all these problems in the following way:
> >
> > - && in a charset means intersection; this sequence is not likely to
> > be used except in error since a single & always suffices
> >
> > - subtraction is handled through intersection with a complemented
> > embedded charset, eg, [a-c&&[^cde]].
> >
> > I think we perceived Java's solution as inelegant when we devised
> > what's now proposed, especially
> > subtraction-by-complement-and-intersection is unpleasant, especially
> > because the nested character sets makes regular expression lexing a
> > little messier. (As it is, and by design, regular expression literals
> > can be lexed by a DFA when the program is parsed; only the regular
> > expression compiler needs to handle the context-free language.
> > Allowing nested sets breaks this.)
>
> That's very interesting, thanks. Perhaps it would suffice
> for drop the [ ], e.g., [a-c&&^cde]. But perhaps that would
> be too close to regular Java not to go the rest of the way.
>
> I don't understand the final parenthetical.
> Regular expressions already have nested parens,
> so I don't understand how nested [ ] makes them
> harder to parse -- I thought that in both cases one
> just looked for the termination character ignoring
> escapes and checking of nesting. Unless /[/]/ is a
> valid way to match a slash, I don't see why the lexer
> needs to care whether the [ ] are properly nested.
That's precisely the problem. The charset [.../...] was not legal in
ES3, the / needed to be escaped. But MSIE allows it and the web uses
it, so the other browsers have followed, and it will be legal in ES4.
--lars
More information about the Es4-discuss
mailing list