regexp char class subtract and intersect

Lars T Hansen lth at acm.org
Thu Jun 14 13:30:47 PDT 2007


I'm actively at work on this, and I'm very mindful about the
compatibility problem.  Most likely we'll follow Java, which should
not be a problem.  More soon.

--lars

On 6/14/07, Russ Cox <rsc at swtch.com> wrote:
> Sorry for continuing to bring this up, but did anything
> get resolved here?  I'm still hoping that ES4 won't
> break the "backslash before punctuation is always
> a literal" rule followed in Perl, PCRE, and many other
> modern regexp packages.
>
> Either using && and &&^ as the current \& and \-
> or using & and &^ as the current \& and \-
> (with an error if CharsetIntersect returns an empty set)
> would avoid this problem.
>
> Russ
>
>
> On 5/7/07, Lars T Hansen <lth at acm.org> wrote:
> > These are all good points.  Java
> > (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> > all these problems in the following way:
> >
> > -  && in a charset means intersection; this sequence is not likely to
> > be used except in error since a single & always suffices
> >
> > - subtraction is handled through intersection with a complemented
> > embedded charset, eg, [a-c&&[^cde]].
> >
> > I think we perceived Java's solution as inelegant when we devised
> > what's now proposed, especially
> > subtraction-by-complement-and-intersection is unpleasant, especially
> > because the nested character sets makes regular expression lexing a
> > little messier.  (As it is, and by design, regular expression literals
> > can be lexed by a DFA when the program is parsed; only the regular
> > expression compiler needs to handle the context-free language.
> > Allowing nested sets breaks this.)
> >
> > I'll put this item on the agenda for the next committee meeting.
> >
> > Sorry for taking so long to answer,
> > --lars
> >
> > On 4/5/07, Russ Cox <rsc at swtch.com> wrote:
> >
> > > Regarding the proposal at
> > > http://developer.mozilla.org/es4/proposals/extend_regexps.html
> > > to add intersection and subtraction of character classes,
> > > it certainly sounds like a good idea in principle, but I am not
> > > sure I agree about the particulars of the proposal.  (Of course
> > > if the Unicode properties go off the table then it makes sense
> > > to toss intersection and subtraction as well; I am assuming
> > > that Unicode properties are still a planned addition.)
> > >
> > > First, the proposal does not specify the binding precedences
> > > of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
> > > also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
> > > show-stopper, just a detail missing from the rough sketch proposal.
> > >
> > > More importantly the proposal breaks the nice regexp syntax property,
> > > inherited from egrep, that escaped punctuation always denotes
> > > the corresponding literal (and that non-punctuation is always a
> > > literal unless escaped).  Adding \- and \& as operators breaks
> > > the rule and adds confusion.  Making \- special is particularly bad,
> > > because - is already special in character classes.  I find the - rules
> > > weird enough that I routinely use \- in character classes when
> > > I want the literal so I don't have to remember when - is special and
> > > when it isn't.
> > >
> > > I suggest using unescaped & as the conjunction operator,
> > > to preserve the property that \punctuation is always a literal.
> > > It is true that this would affect existing character classes that
> > > list &, but these can be diagnosed by the compiler as an
> > > empty character class: [!@#$%&*()] is obviously (to the compiler)
> > > a mistake, because the intersection is empty.
> > >
> > > I also suggest not adding an explicit subtraction operator but
> > > instead using intersection with the complement, since there is
> > > already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
> > > [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.
> > >
> > > The specific grammar would be something like:
> > >
> > >   CharClass ::= "[" ClassIntersect "]"
> > >   ClassComplement ::= ClassRanges | "^" ClassRanges
> > >   ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement
> > >
> > > Russ
> > > _______________________________________________
> > > Es4-discuss mailing list
> > > Es4-discuss at mozilla.org
> > > https://mail.mozilla.org/listinfo/es4-discuss
> > >
> >
> >
>



More information about the Es4-discuss mailing list