Regex: How should backreferences contained in the capturing match they reference to work?

Lars T Hansen lth at acm.org
Mon Sep 3 02:02:57 PDT 2007


On 9/2/07, liorean <liorean at gmail.com> wrote:
>
>     var
>         re=/(a\1)+|b/,
>         a=[],
>         key,
>         aab=re.exec('aab')
>         aaab=re.exec('aaab');
>     for(key in aab)
>          a.push('aab["'+key+'"]: '+aab[key]);
>     for(key in aaab)
>          a.push('aaab["'+key+'"]: '+aaab[key]);
>     a.join('\r\n');

>From the spec it's pretty clear (to me anyhow) that SpiderMonkey and
Opera are correct here (and the ES4 reference implementation, whose
RegEx engine was modelled directly on the spec, agrees).  The
reasoning is that (a) the **undefined** value in the captures array
matches any input without advancing the input pointer [15.10.2.9 step
8 substep 3], (b) every element in the captures array starts out as
**undefined** [15.10.2.2 step 2 substep 4], (c) every element in the
captures array affected by a particular set of capturing parens is
reset to **undefined** every time those parens are entered [15.10.2.5
case "RepeatMatcher" step 4], and (d) the captures array is only
updated after the capturing parens are exited during matching
[15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5].
Therefore the pattern above is always the same as /(a)+|b/.  Therefore
the results are what they are.

> It seems to me that, looking at the situation from the perspective of
> a user of regex, that there are two ways of tackling this problem that
> makes sense, and they are hinged on whether a captured submatch is
> considered to be undefined or the empty string by default.

In ES3 they are clearly **undefined** by default.

> If the capture is considered to be undefined by default, then the ES3
> solution makes most sense... except for one thing: a capturing
> submatch containing a backreference to the containing capture means a
> guaranteed failure to match in that path, and is pretty easy to detect
> at compile time.

The way I understand the spec it means a guaranteed success.  Still
pretty silly, to be sure.

> Throwing a syntax error at compile time seems more
> appropriate since I hardly think developers want their regex to spend
> time in constructs that cannot possibly match. Alternatively, engines
> can simply throw out that entire path at compile time and just set all
> the contained captures to undefined - behaviour would still be
> identical to what ES3 specifies. I.e. it's a choice between loud or
> silent failure.

Or a vacuous and silent success, as in this case.

I'm not sure what problem you're trying to solve, but my hope is that
once ES4 is widely implemented programmers will switch to named
submatches to avoid the problem with numeric backreferences entirely.
In the mean time, there's obviously room for a good implementation to
warn about ineffectual submatches like these.  Changing the semantics
to an error is not very appealing, on the one hand I doubt it will
break a lot of sites, on the other hand I doubt it will fix many
either.

--lars



More information about the Es4-discuss mailing list