Regex: How should backreferences contained in the capturing match they reference to work?

liorean liorean at gmail.com
Sun Sep 2 05:09:40 PDT 2007


Hello,

I've been playing around with implementing ES3/ES4 regex, (more based
on observation of what the browser hosted engines seem to do than by
the spec, since I'm trying to build a non-backtracking engine) and
noted this case where the browsers don't agree:

    var
        re=/(a\1)+|b/,
        a=[],
        key,
        aab=re.exec('aab')
        aaab=re.exec('aaab');
    for(key in aab)
         a.push('aab["'+key+'"]: '+aab[key]);
    for(key in aaab)
         a.push('aaab["'+key+'"]: '+aaab[key]);
    a.join('\r\n');

JavaScriptCore:
    aab["0"]: b
    aab["index"]: 2
    aab["input"]: aab
    aaab["0"]: b
    aaab["index"]: 3
    aaab["input"]: aaab
JScript:
    aab["input"]: aab
    aab["index"]: 0
    aab["lastIndex"]: 1
    aab["0"]: a
    aab["1"]: a
    aaab["input"]: aaab
    aaab["index"]: 0
    aaab["lastIndex"]: 3
    aaab["0"]: aaa
    aaab["1"]: aa
SpiderMonkey:
    aab["0"]: aa
    aab["1"]: a
    aab["index"]: 0
    aab["input"]: aab
    aaab["0"]: aaa
    aaab["1"]: a
    aaab["index"]: 0
    aaab["input"]: aaab
futhark:
    aab["0"]: aa
    aab["1"]: a
    aab["index"]: 0
    aab["input"]: aab
    aaab["0"]: aaa
    aaab["1"]: a
    aaab["index"]: 0
    aaab["input"]: aaab
linear_b:
    aab["0"]: aa
    aab["1"]: a
    aab["index"]: 0
    aab["input"]: aab
    aaab["0"]: aaa
    aaab["1"]: a
    aaab["index"]: 0
    aaab["input"]: aaab
ES3 spec:
    From my reading of 15.10.2.9 AtomEscape, this matches what
JavaScriptCore does except for that ES3 defines the captures and sets
them to undefined, whereas JavaScriptCore doesn't seem to define them
at all - They are missing from the results rather than included with a
value of undefined.

As you can see, three different ways to interpret the handling:
    JavaScriptCore:
        Full match on 'aaab': 'b',
        First capture: not defined

        My guesson how this parsing works:
            First repetition, capture is not-yet-defined so the
backreference fails to match anything, which means it goes to the
alternate path.

    JScript:
        Full match on 'aaab': 'aaa',
        First capture: 'aa'

        My guess on how this parsing works:
            First repetition, the first capture is '', thus the
submatch is on 'a'+''='a', which is the new value of the first
capture.
            Second repetition, first capture is 'a', thus the submatch
is on 'a'+'a'='aa', which is the new value for the first capture.
            This is consistent with what happens as you add more 'a's
to the string being matched.

    SpiderMonkey, futhark, linear_b:
        Full match on 'aaab': 'aaa',
        First capture: 'a'

        My guess on how this parsing works:
            First repetition, the first capture is set to '', thus the
submatch is on 'a'+''='a', which is the new value of the first
capture.
            Second repetition, first capture is reset to '', thus the
submatch is on 'a'+''='a', which is the new value for the first
capture.
            Third repetition, first capture is reset to '', thus the
submatch is on 'a'+''='a', which is the new value for the first
capture.



It seems to me that, looking at the situation from the perspective of
a user of regex, that there are two ways of tackling this problem that
makes sense, and they are hinged on whether a captured submatch is
considered to be undefined or the empty string by default.


If the capture is considered to be undefined by default, then the ES3
solution makes most sense... except for one thing: a capturing
submatch containing a backreference to the containing capture means a
guaranteed failure to match in that path, and is pretty easy to detect
at compile time. Throwing a syntax error at compile time seems more
appropriate since I hardly think developers want their regex to spend
time in constructs that cannot possibly match. Alternatively, engines
can simply throw out that entire path at compile time and just set all
the contained captures to undefined - behaviour would still be
identical to what ES3 specifies. I.e. it's a choice between loud or
silent failure.

If the capture is considered to be the empty string by default, then
the JScript solution makes most sense. There's little logic in
resetting the capture to the empty string once the submatch has been
captured.
-- 
David "liorean" Andersson



More information about the Es4-discuss mailing list