Look-behind proposal in trouble

Nozomu Katō noz.ka at akenotsuki.com
Fri Oct 9 13:00:55 UTC 2015

Erik Corry wrote on Fri, 9 Oct 2015, at 10:52:09 +0200:
> I made an implementation of .NET-style variable length lookbehinds.  It's
> not in a JS engine, but it's in a very simple (and very slow)
> ES5-compatible regexp engine that is used in the tiny Dart implementation
> named Fletch.
> No unicode issues arise since this engine does not support /u, but I don't
> expect any issues since it's not trying to second-guess the length of  the
> string matched by an expression.
> Needs a lot more tests, but it seems to work OK and was surprisingly simple
> to do.  Basically:
> * All steps in the input string are reversed, so if you would step forwards
> you step backwards.
> * Check for start of string instead of end of string.
> * Test against the character to the left of the cursor instead of to the
> right.
> * The parts of the Alternative (see the regexp grammar in the standard) are
> code-generated in reverse order.
> Code is here: https://codereview.chromium.org/1398033002/

Me too; I have once implemented lookbehind assertions by using this way
in SRELL, my C++ template library whose engine is compatible with RegExp
of ECMAScript but whose class design is compatible with std::regex of
C++ [1].

However, later I removed the code for such lookbehinds and adopted Perl5
style lookbehinds instead. The core reasons are:

1. Right-to-left matchers are used only in lookbehind assertions;
2. Nevertheless, these cannot share code with normal (left-to-right)
   matchers and need their own optimization processes.

Thus, I came to feel that what I can get and what I have to do are

In my understanding, features that are available in .NET style
lookbehinds but are not so and even cannot be emulated in Perl5 style
lookbehinds are 1) the use of the backreference and 2) the use of the
quantifiers other than {n}. The others can be emulated in some way.

For example, the positive multiple-length lookbehind (?<=ab|cde) can be
substituted by (?:(?<=ab)|(?<=cde)). The substitution of the negative
multiple-length lookbehind is more simple, only to write assertions in
succession; for example, (?<!ab|cde) can be written as (?<!ab)(?<!cde).

I guess that oniguruma supports expressions like (?<=ab|cde) by doing
such substitutions inside the library, but just my guess.

So, I came to feel that Perl5 style lookbehinds are balanced. It may not
be best, though. In fact, the current implementation for lookbehinds in
my library is far simple; it shares code with lookaheads. If the count
to rewind is 0 then it means lookahead, otherwise (if equal to or more
than 1) it means lookbehind.

If we would introduce .NET style lookbehinds into RegExp of ECMAScript,
it would need someone who writes right-to-left versions of the most
parts of the definitions under 21.2 of the specification.


[1] http://www.akenotsuki.com/misc/srell/en/

More information about the es-discuss mailing list