RegExp lookbehind

Steven L. steves_list at hotmail.com
Sun Mar 18 11:04:23 PDT 2012


Erik Corry wrote:
> Steven Levithan wrote:
>> In practice, at least for [Java-style finite-length] lookbehind, this
>> attempt to avoid far-back searches is kind of silly--e.g., Java lets you 
>> use
>> a quantifier like {0,100000} within lookbehind.
>
> At least this provides something to point the user at and say "look,
> this is why you have bad performance".

Fair point.

Erik Corry wrote:
> Really?  I don't have a .NET implementation handy, but I'll make a
> prediction based on the description of its algorithm.  Lets look at
> the following.  (replace the '+'s with {1,1000} for the sake of the
> non-.NET regexps, I'm not going to type that ugliness here):
>
> /(x(.+?))$/
> /(?<(x(.+?))$/
>
> Give these regexps the input: "x foo x bar".  The first should match
> "x foo x bar" and the second will match "x bar" as the non-greedy
> quantifier will stop when it finds the nearest x from the end.  But in
> a regexp engine with the "start at the earliest point and search
> forwards" kludge they will return the same.

You're right for .NET. But I must add: I wasn't really considering the 
nongreedy case (which should have been obvious!) in any of my previous 
descriptions of lookbehind behavior or developer expectations for it. Boo, 
me! :-( I've just done some testing that also shows I've been 
wrong/oversimplifying/overgeneralizing in my descriptions of lookbehind 
behavior and implementation across multiple regex flavors (even in the 
greedy case).

Let me give some real results. I'll leave out comparisons with 
/(nonlookbehind)$/ since I think everyone interested in this discussion 
knows what those results should be.

/(?<=(\d+))$/ with "123":
    * .NET: $1 == "123".
    * Java: [Unsupported]
    * PCRE: [Unsupported]

/(?<=(\d{1,3}))$/ with "123":
    * .NET: $1 == "123".
    * Java: $1 == "3".
    * PCRE: [Unsupported]

/(?<=(\d{1,3}?))$/ with "123":
    * .NET: $1 == "3".
    * Java: $1 == "3".
    * PCRE: [Unsupported]

/(?<=(\d{1,3})(\d{1,3}?))$/ with "123":
    * .NET: $1 == "12", $2 == "3".
    * Java: $1 == "2", $2 == "3".
    * PCRE: [Unsupported]

/(?<=(\d{1,3}?)(\d{1,3}))$/ with "123":
    * .NET: $1 == "1", $2 == "23".
    * Java: $1 == "2", $2 == "3".
    * PCRE: [Unsupported]

/(?<=(\d)|(\d\d)|(\d\d\d))$/ with "123":
    * .NET: $1 == "3". [$2 and $3 don't participate.]
    * Java: $1 == "3". [$2 and $3 don't participate.]
    * PCRE: $1 == "3". [$2 and $3 don't participate.]

/(?<=(\d\d\d)|(\d\d)|(\d))$/ with "123":
    * .NET: $1 == "123". [$2 and $3 don't participate.]
    * Java: $3 == "3". [$1 and $2 don't participate.]
    * PCRE: $1 == "123". [$2 and $3 don't participate.]

Notes:

* There is more to test (e.g., nested lookbehind, supported by .NET, Java, 
and PCRE), but this should demonstrate the basic semantics used by the three 
major implementations that support capturing within variable-length 
lookbehind.

* Oniguruma (used by default in Ruby 1.9) supports /(?<=a|bc)/ but not 
/(?<=ab?)/ or /(?<=a(?:b|cd))/. It also doesn't support capturing groups in 
lookbehind.

* Perl, Python, and Boost.Regex support fixed-length lookbehind only: 
(?<=abc) or (?<=a|b).

* Ruby 1.8, Tcl, XML Schema/XPath, RE2, and ERE/BRE do not support 
lookbehind at all.

* I haven't tested ICU Regular Expressions, but their docs say they support 
finite-length lookbehind (without specifics).

IMO, .NET's lookbehind is the flagship/ideal, both in its support for all 
regex syntax and in it's greedy/nongreedy behavior.

This is probably as good a place as any to note that, should the committee 
feel compelled to reconsider (or augment) lookbehind support, there is 
another simpler and more efficient idea that works similar to lookbehind at 
the start of a pattern: \K ("keep"). \K was invented by Jeff Pinyan (see 
[1]) and made it's way into Perl 5.10, PCRE 7.2, and Boost.Regex (ver?).

\K effectively resets the match start wherever it is encountered, causing 
any previously matched characters not to be included in the final match 
(e.g., /foo\Kbar/ matches "foobar", but reports that it matched "bar"). This 
avoids lookbehind-esque concerns about infinite/finite/fixed length. Also 
note that \K doesn't interfere with the setting of backreferences (e.g., 
/(foo)\Kbar/ sets $1 to "foo"), and can be used within alternatives (e.g., 
/foo|bar\Kbaz/ matches "foo" or "barbaz", but reports matching "foo" or 
"baz") or repeated groups, etc.

\K has a couple issues of its own, though:

    * Perl docs say \K's behavior in lookaround is "not well defined". In 
PCRE, \K is acted upon within positive lookaround but ignored in negative 
lookaround.

    * \K in ES would open the possibility of zero-length matches that change 
the value of RegExp.prototype.lastIndex. This isn't a problem on its own, 
but it can effect how accompanying code must be written. It could also break 
code unwise to \K that accepts arbitrary regexes from an outside source.

-- Steven Levithan

[1]: http://search.cpan.org/~pinyan/Regexp-Keep-0.02/Keep.pm




More information about the es-discuss mailing list