Look-behind proposal

Nozomu Katō noz.ka at akenotsuki.com
Mon May 18 13:50:55 UTC 2015


I submit a proposal for adding the look-behind assertions to RegExp.

----
Abstract

Basically, the evaluation for the look-behind assertion is performed by
"rewind the target sequence and do the same thing as the look-ahead
assertion":

  1. When compiling a regular expression pattern, count the total number
     of PatternCharacter and . and \ AtomEscape and CharacterClass in
     Disjunction per look-behind assertion. Each Atom may be followed by
     {n}, but may not be followed by the other quantifiers.
  2. When a look-behind assertion evaluated, rewind the target sequence
     by that number and evaluate as if the look-ahead assertion from
     that point.

--
Static Semantics: Early Errors (21.2.1.1 in ECMAScript Specifition 6)

  Assertion :: ( ? < = Disjunction )
  Assertion :: ( ? < ! Disjunction )

  It is a Syntax Error if Disjunction contains Quantifier ::
  QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.

  When Disjunction is separated by the | regular expression, if the
  number of the sequence of code points which the left Alternative
  matches and the number of the sequence of code points which the right
  Disjunction matches are different, then it is a Syntax Error.


Assertion (21.2.2.6 in ECMAScript Specifition 6)

The production Assertion :: ( ? < = Disjunction ) evaluates as follows:

1. Evaluate Disjunction to obtain a Matcher m.
2. Return an internal Matcher closure that takes two arguments, a State
   x and a Continuation c, and performs the following steps:
  1. Let n be the exact number of the sequence of code points which
     Disjunction matches [NOTE].
  2. Let xe be x's endIndex.
  3. If xe < n, return failure.
  4. Let xcap be x's captures List.
  5. Let y be the State (xe-n, xcap).
  6. Let d be a Continuation that always returns its State argument as a
     successful MatchResult.
  7. Call m(y, d) and let r be its result.
  8. If r is failure, return failure.
  9. Let z be r's State.
  10. Let zcap be z's captures List.
  11. Let a be the State (xe, zcap).
  12. Call c(a) and return its result.

The production Assertion :: ( ? < ! Disjunction ) evaluates as follows:

1. Evaluate Disjunction to obtain a Matcher m.
2. Return an internal Matcher closure that takes two arguments, a State
   x and a Continuation c, and performs the following steps:
  1. Let n be the exact number of the sequence of code points which
     Disjunction matches [NOTE].
  2. Let xe be x's endIndex.
  3. If xe < n, then call c(x) and return its result.
  4. Let xcap be x's captures List.
  5. Let y be the State (xe-n, xcap).
  6. Let d be a Continuation that always returns its State argument as a
     successful MatchResult.
  7. Call m(y, d) and let r be its result.
  8. If r is not failure, return failure.
  9. Call c(x) and return its result.

NOTE: n is computed by counting and summing up the following Atoms in
the Disjunction:
  a. Atom :: PatternCharacter
  b. Atom :: .
  c. Atom :: \ AtomEscape
  d. Atom :: CharacterClass

If an Atom is followed by a "QuantifierPrefix :: { DecimalDigits }",
then that Atom is counted as the number of DecimalDigits instead of one.

If the Disjunction contains an "Atom :: ( Disjunction )" or "Atom :: ( ?
: Disjunction )" that is followed by a "QuantifierPrefix :: {
DecimalDigits }", then the number of code points which that
inner-Disjunction matches is multiplied by the number of DecimalDigits.

----

If it is not necessarily important to strictly follow the syntax of Perl
5, then adopting the following syntax may make implemetation simple:

  (?<{n}= Disjunction)
  (?<{n}! Disjunction)

where n in {} is the number of code points by which the target sequence
is rewound when a look-behind assertion is evaluated. In the case of
this syntax, it is not required to count Atoms in Disjunction and the
number to rewind the sequence of code points is taken from n directly.


Regards,
  Nozomu



More information about the es-discuss mailing list