RegExp lookbehind

Erik Corry erik.corry at gmail.com
Sun Mar 18 01:00:25 PDT 2012


2012/3/18 Steven L. <steves_list at hotmail.com>:
> Lasse Reichstein wrote:
>
>>> I would simply apply same logic we have already for the look ahead ... or
>>> you think that would cause problems?
>>
>>
>> I'm not sure it even makes sense.
>>
>> ES RegExps are backtracking based, and it makes a difference in which
>> order alternatives are tried. Greedy matching is defined in terms of
>> number of repetitions, not length of the match. All of these are
>> defined in a way that assumes left-to-right matching.
>>
>> Example:
>>  Take the RegExp  /(?<((?:aa|aaa)+))b/  where (?< ... ) delimits the
>> look-behind.
>>  and try matching it on the string "xaaaaaaaaab".
>>  Then tell me how many a's are captured by the capturing group, and why :)
>>
>> The most "intuitive" interpretation would be a reverse implementation
>> of the normal matching algorithm, i.e., "backwards matching", but that
>> would likely duplicate the entire RegExp semantics (or parameterize it
>> by a direction).
>>
>> Any attempt to use the normal (forward) semantics and then try to find
>> an earlier point to start it at is likely to be either flawed or
>> effectively unpredictable to users.
>
>
> Technically, you're right. They're different. But they can appear exactly
> the same by implementing lookbehind as a zero-length assertion of
> (?:lookbehind)$ matched against the lookbehind's left context, starting from
> the very start of the subject string. Although people thinking about
> implementation might come to think of some other approach as more intuitive,
> from my experience every single plain-old-developer unconcerned about
> implementation thinks of the semantics I just described as intuitive. It is
> also how every single implementation of lookbehind that I know of actually
> works.
>
> The reason that all major regex flavors except .NET don't support lookbehind
> is because it's inefficient to re-search from the very beginning of an
> arbitrarily long string. That's why they support fixed- or finit-length
> lookbehind only--if they can determine the maximum distance backward they
> need to search forward from, they can step back only that many characters.

No wonder that look-behinds have a reputation for poor performance if
this is how it's done.

> In practice, at least for finite- rather than fixed-length lookbehind, this
> attempt to avoid far-back searches is kind of silly--e.g., Java lets you use
> a quantifier like {0,100000} within lookbehind.

At least this provides something to point the user at and say "look,
this is why you have bad performance".

> The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. It
> magically follows the plain-old-developer's intuitive expectation while
> working backword rather than from the start of the string. Unfortunately,
> how it actually works is fairly mysterious. Although it works fairly
> reliably, as I previously mentioned it can occasionally be a bit
> buggy/weird.
>
>
>> And you will probably never achieve that /(<re>)$/ and /(?<(re))$/
>> always capture the same substring :)
>
>
> Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same string
> in every implementation of lookbehind that I know of.

Really?  I don't have a .NET implementation handy, but I'll make a
prediction based on the description of its algorithm.  Lets look at
the following.  (replace the '+'s with {1,1000} for the sake of the
non-.NET regexps, I'm not going to type that ugliness here):

/(x(.+?))$/
/(?<(x(.+?))$/

Give these regexps the input: "x foo x bar".  The first should match
"x foo x bar" and the second will match "x bar" as the non-greedy
quantifier will stop when it finds the nearest x from the end.  But in
a regexp engine with the "start at the earliest point and search
forwards" kludge they will return the same.

Sadly this means that if we settle on the non-.NET way to implement
look-behind it will be user-visible, so that implementations will not
have the option of using the efficient .NET algorithm.  Also we will
never be able to get rid of the limitation on infinite length
lookbehind without losing backwards compatibility.

-- 
Erik Corry


More information about the es-discuss mailing list