Look-behind proposal
Claude Pache
claude.pache at gmail.com
Mon May 18 18:51:52 UTC 2015
Old thread of interest:
https://esdiscuss.org/topic/regexp-lookbehind <https://esdiscuss.org/topic/regexp-lookbehind>
I've also personally thought quite thoroughly on how to add look-behind assertions in ES RegExps.
Your approach imposes that the exact length of a string matched by the sub-pattern `( ? < = Disjunction )` is statically determinable This is, indeed, the original semantics of Perl 5.
Another more powerful approach (which I prefer), is the one followed by .NET: the sub-pattern `( ? < = Disjunction )` is tentatively matched _backwards_. The advantage is that it does not impose any restriction on the subpattern, while remaining as efficient as possible (in theory, about as efficient as an equivalent look-ahead assertion). The main drawback is that it may be somewhat confusing for some types of complicated subpatterns, especially for those that contain backreferences. (Also it is more mind-bending to spec.)
That said, I think there are easier missing features to be added, that are supported by most other RegExp flavours. I'm thinking of:
* the `s`, or `.dotall` flag: the dot `.` matches every character, including newlines;
* true support of Unicode, namely: escape sequences such as `\Lu` for uppercase letter, or `\X` for grapheme cluster.
—Claude
> Le 18 mai 2015 à 15:50, Nozomu Katō <noz.ka at akenotsuki.com> a écrit :
>
> I submit a proposal for adding the look-behind assertions to RegExp.
>
> ----
> Abstract
>
> Basically, the evaluation for the look-behind assertion is performed by
> "rewind the target sequence and do the same thing as the look-ahead
> assertion":
>
> 1. When compiling a regular expression pattern, count the total number
> of PatternCharacter and . and \ AtomEscape and CharacterClass in
> Disjunction per look-behind assertion. Each Atom may be followed by
> {n}, but may not be followed by the other quantifiers.
> 2. When a look-behind assertion evaluated, rewind the target sequence
> by that number and evaluate as if the look-ahead assertion from
> that point.
>
> --
> Static Semantics: Early Errors (21.2.1.1 in ECMAScript Specifition 6)
>
> Assertion :: ( ? < = Disjunction )
> Assertion :: ( ? < ! Disjunction )
>
> It is a Syntax Error if Disjunction contains Quantifier ::
> QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.
>
> When Disjunction is separated by the | regular expression, if the
> number of the sequence of code points which the left Alternative
> matches and the number of the sequence of code points which the right
> Disjunction matches are different, then it is a Syntax Error.
>
>
> Assertion (21.2.2.6 in ECMAScript Specifition 6)
>
> The production Assertion :: ( ? < = Disjunction ) evaluates as follows:
>
> 1. Evaluate Disjunction to obtain a Matcher m.
> 2. Return an internal Matcher closure that takes two arguments, a State
> x and a Continuation c, and performs the following steps:
> 1. Let n be the exact number of the sequence of code points which
> Disjunction matches [NOTE].
> 2. Let xe be x's endIndex.
> 3. If xe < n, return failure.
> 4. Let xcap be x's captures List.
> 5. Let y be the State (xe-n, xcap).
> 6. Let d be a Continuation that always returns its State argument as a
> successful MatchResult.
> 7. Call m(y, d) and let r be its result.
> 8. If r is failure, return failure.
> 9. Let z be r's State.
> 10. Let zcap be z's captures List.
> 11. Let a be the State (xe, zcap).
> 12. Call c(a) and return its result.
>
> The production Assertion :: ( ? < ! Disjunction ) evaluates as follows:
>
> 1. Evaluate Disjunction to obtain a Matcher m.
> 2. Return an internal Matcher closure that takes two arguments, a State
> x and a Continuation c, and performs the following steps:
> 1. Let n be the exact number of the sequence of code points which
> Disjunction matches [NOTE].
> 2. Let xe be x's endIndex.
> 3. If xe < n, then call c(x) and return its result.
> 4. Let xcap be x's captures List.
> 5. Let y be the State (xe-n, xcap).
> 6. Let d be a Continuation that always returns its State argument as a
> successful MatchResult.
> 7. Call m(y, d) and let r be its result.
> 8. If r is not failure, return failure.
> 9. Call c(x) and return its result.
>
> NOTE: n is computed by counting and summing up the following Atoms in
> the Disjunction:
> a. Atom :: PatternCharacter
> b. Atom :: .
> c. Atom :: \ AtomEscape
> d. Atom :: CharacterClass
>
> If an Atom is followed by a "QuantifierPrefix :: { DecimalDigits }",
> then that Atom is counted as the number of DecimalDigits instead of one.
>
> If the Disjunction contains an "Atom :: ( Disjunction )" or "Atom :: ( ?
> : Disjunction )" that is followed by a "QuantifierPrefix :: {
> DecimalDigits }", then the number of code points which that
> inner-Disjunction matches is multiplied by the number of DecimalDigits.
>
> ----
>
> If it is not necessarily important to strictly follow the syntax of Perl
> 5, then adopting the following syntax may make implemetation simple:
>
> (?<{n}= Disjunction)
> (?<{n}! Disjunction)
>
> where n in {} is the number of code points by which the target sequence
> is rewound when a look-behind assertion is evaluated. In the case of
> this syntax, it is not required to count Atoms in Disjunction and the
> number to rewind the sequence of code points is taken from n directly.
>
>
> Regards,
> Nozomu
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150518/a792b5e6/attachment.html>
More information about the es-discuss
mailing list