Look-behind proposal

Claude Pache claude.pache at gmail.com
Mon May 18 18:51:52 UTC 2015


Old thread of interest:

https://esdiscuss.org/topic/regexp-lookbehind <https://esdiscuss.org/topic/regexp-lookbehind>

I've also personally thought quite thoroughly on how to add look-behind assertions in ES RegExps.

Your approach imposes that the exact length of a string matched by the sub-pattern `( ? < = Disjunction )` is statically determinable This is, indeed, the original semantics of Perl 5. 

Another more powerful approach (which I prefer), is the one followed by .NET: the sub-pattern `( ? < = Disjunction )` is tentatively matched _backwards_. The advantage is that it does not impose any restriction on the subpattern, while remaining as efficient as possible (in theory, about as efficient as an equivalent look-ahead assertion). The main drawback is that it may be somewhat confusing for some types of complicated subpatterns, especially for those that contain backreferences. (Also it is more mind-bending to spec.)

That said, I think there are easier missing features to be added, that are supported by most other RegExp flavours. I'm thinking of:

* the `s`, or `.dotall` flag: the dot `.` matches every character, including newlines;
* true support of Unicode, namely: escape sequences such as `\Lu` for uppercase letter, or `\X` for grapheme cluster.

—Claude


> Le 18 mai 2015 à 15:50, Nozomu Katō <noz.ka at akenotsuki.com> a écrit :
> 
> I submit a proposal for adding the look-behind assertions to RegExp.
> 
> ----
> Abstract
> 
> Basically, the evaluation for the look-behind assertion is performed by
> "rewind the target sequence and do the same thing as the look-ahead
> assertion":
> 
>  1. When compiling a regular expression pattern, count the total number
>     of PatternCharacter and . and \ AtomEscape and CharacterClass in
>     Disjunction per look-behind assertion. Each Atom may be followed by
>     {n}, but may not be followed by the other quantifiers.
>  2. When a look-behind assertion evaluated, rewind the target sequence
>     by that number and evaluate as if the look-ahead assertion from
>     that point.
> 
> --
> Static Semantics: Early Errors (21.2.1.1 in ECMAScript Specifition 6)
> 
>  Assertion :: ( ? < = Disjunction )
>  Assertion :: ( ? < ! Disjunction )
> 
>  It is a Syntax Error if Disjunction contains Quantifier ::
>  QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.
> 
>  When Disjunction is separated by the | regular expression, if the
>  number of the sequence of code points which the left Alternative
>  matches and the number of the sequence of code points which the right
>  Disjunction matches are different, then it is a Syntax Error.
> 
> 
> Assertion (21.2.2.6 in ECMAScript Specifition 6)
> 
> The production Assertion :: ( ? < = Disjunction ) evaluates as follows:
> 
> 1. Evaluate Disjunction to obtain a Matcher m.
> 2. Return an internal Matcher closure that takes two arguments, a State
>   x and a Continuation c, and performs the following steps:
>  1. Let n be the exact number of the sequence of code points which
>     Disjunction matches [NOTE].
>  2. Let xe be x's endIndex.
>  3. If xe < n, return failure.
>  4. Let xcap be x's captures List.
>  5. Let y be the State (xe-n, xcap).
>  6. Let d be a Continuation that always returns its State argument as a
>     successful MatchResult.
>  7. Call m(y, d) and let r be its result.
>  8. If r is failure, return failure.
>  9. Let z be r's State.
>  10. Let zcap be z's captures List.
>  11. Let a be the State (xe, zcap).
>  12. Call c(a) and return its result.
> 
> The production Assertion :: ( ? < ! Disjunction ) evaluates as follows:
> 
> 1. Evaluate Disjunction to obtain a Matcher m.
> 2. Return an internal Matcher closure that takes two arguments, a State
>   x and a Continuation c, and performs the following steps:
>  1. Let n be the exact number of the sequence of code points which
>     Disjunction matches [NOTE].
>  2. Let xe be x's endIndex.
>  3. If xe < n, then call c(x) and return its result.
>  4. Let xcap be x's captures List.
>  5. Let y be the State (xe-n, xcap).
>  6. Let d be a Continuation that always returns its State argument as a
>     successful MatchResult.
>  7. Call m(y, d) and let r be its result.
>  8. If r is not failure, return failure.
>  9. Call c(x) and return its result.
> 
> NOTE: n is computed by counting and summing up the following Atoms in
> the Disjunction:
>  a. Atom :: PatternCharacter
>  b. Atom :: .
>  c. Atom :: \ AtomEscape
>  d. Atom :: CharacterClass
> 
> If an Atom is followed by a "QuantifierPrefix :: { DecimalDigits }",
> then that Atom is counted as the number of DecimalDigits instead of one.
> 
> If the Disjunction contains an "Atom :: ( Disjunction )" or "Atom :: ( ?
> : Disjunction )" that is followed by a "QuantifierPrefix :: {
> DecimalDigits }", then the number of code points which that
> inner-Disjunction matches is multiplied by the number of DecimalDigits.
> 
> ----
> 
> If it is not necessarily important to strictly follow the syntax of Perl
> 5, then adopting the following syntax may make implemetation simple:
> 
>  (?<{n}= Disjunction)
>  (?<{n}! Disjunction)
> 
> where n in {} is the number of code points by which the target sequence
> is rewound when a look-behind assertion is evaluated. In the case of
> this syntax, it is not required to count Atoms in Disjunction and the
> number to rewind the sequence of code points is taken from n directly.
> 
> 
> Regards,
>  Nozomu
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150518/a792b5e6/attachment.html>


More information about the es-discuss mailing list