regexp char class subtract and intersect

Russ Cox rsc at swtch.com
Thu Jun 14 12:15:28 PDT 2007


Sorry for continuing to bring this up, but did anything
get resolved here?  I'm still hoping that ES4 won't
break the "backslash before punctuation is always
a literal" rule followed in Perl, PCRE, and many other
modern regexp packages.

Either using && and &&^ as the current \& and \-
or using & and &^ as the current \& and \-
(with an error if CharsetIntersect returns an empty set)
would avoid this problem.

Russ


On 5/7/07, Lars T Hansen <lth at acm.org> wrote:
> These are all good points.  Java
> (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> all these problems in the following way:
>
> -  && in a charset means intersection; this sequence is not likely to
> be used except in error since a single & always suffices
>
> - subtraction is handled through intersection with a complemented
> embedded charset, eg, [a-c&&[^cde]].
>
> I think we perceived Java's solution as inelegant when we devised
> what's now proposed, especially
> subtraction-by-complement-and-intersection is unpleasant, especially
> because the nested character sets makes regular expression lexing a
> little messier.  (As it is, and by design, regular expression literals
> can be lexed by a DFA when the program is parsed; only the regular
> expression compiler needs to handle the context-free language.
> Allowing nested sets breaks this.)
>
> I'll put this item on the agenda for the next committee meeting.
>
> Sorry for taking so long to answer,
> --lars
>
> On 4/5/07, Russ Cox <rsc at swtch.com> wrote:
>
> > Regarding the proposal at
> > http://developer.mozilla.org/es4/proposals/extend_regexps.html
> > to add intersection and subtraction of character classes,
> > it certainly sounds like a good idea in principle, but I am not
> > sure I agree about the particulars of the proposal.  (Of course
> > if the Unicode properties go off the table then it makes sense
> > to toss intersection and subtraction as well; I am assuming
> > that Unicode properties are still a planned addition.)
> >
> > First, the proposal does not specify the binding precedences
> > of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
> > also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
> > show-stopper, just a detail missing from the rough sketch proposal.
> >
> > More importantly the proposal breaks the nice regexp syntax property,
> > inherited from egrep, that escaped punctuation always denotes
> > the corresponding literal (and that non-punctuation is always a
> > literal unless escaped).  Adding \- and \& as operators breaks
> > the rule and adds confusion.  Making \- special is particularly bad,
> > because - is already special in character classes.  I find the - rules
> > weird enough that I routinely use \- in character classes when
> > I want the literal so I don't have to remember when - is special and
> > when it isn't.
> >
> > I suggest using unescaped & as the conjunction operator,
> > to preserve the property that \punctuation is always a literal.
> > It is true that this would affect existing character classes that
> > list &, but these can be diagnosed by the compiler as an
> > empty character class: [!@#$%&*()] is obviously (to the compiler)
> > a mistake, because the intersection is empty.
> >
> > I also suggest not adding an explicit subtraction operator but
> > instead using intersection with the complement, since there is
> > already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
> > [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.
> >
> > The specific grammar would be something like:
> >
> >   CharClass ::= "[" ClassIntersect "]"
> >   ClassComplement ::= ClassRanges | "^" ClassRanges
> >   ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement
> >
> > Russ
> > _______________________________________________
> > Es4-discuss mailing list
> > Es4-discuss at mozilla.org
> > https://mail.mozilla.org/listinfo/es4-discuss
> >
>
>



More information about the Es4-discuss mailing list