Suggested RegExp Improvements

Lasse Reichstein reichsteinatwork at gmail.com
Tue Nov 16 04:15:04 PST 2010


[Unterminated statement detected, fixing ...]

On Tue, 16 Nov 2010 13:12:36 +0100, Lasse Reichstein  
<reichsteinatwork at gmail.com> wrote:

> On Mon, 15 Nov 2010 16:23:13 +0100, Marc Harter <wavded at gmail.com> wrote:
>
> [look-behind allowing variable length body]
>>
>> This was not my intention.  I am proposing zero-width lookbehind, which
>> would not allow for the case you specified above.
>
> The grammar allows it. In ECMAScript it would be:
>   "foobarbaz".match(/a(?<=(ob|bab)?)/
> which would match the first "a".
> Had it been written
>   "foobarbaz".match(/a(?<=(ob|bab)?.)/

... then it would match "a" and capture "ob", assuming semantics symmetric
to look-ahead.

>
>> I will update the
>> proposal.  It is my understanding that lookahead as implemented in
>> ECMAScript also is zero-width and not variable.  This is also how Perl
>> has implemented lookbehind.
>
> The look-ahead in ECMAScript has a Disjunction as content, which  
> basically means
> that it can contain *any* RegExp (including quantified statements and  
> other lookaheads).
> This works fine because the semantics of the disjunction is the same as  
> any other
> disjunction in a RegExp: it's matched forwards from a position in the  
> input.
>
> Your proposal also uses a Disjunction as body, but it's not specified  
> how to
> evaluate that body so that it *ends* at the position of the assertion.
> Executing a RegExp "backwards" isn't trivial. Well, mostly it is, by  
> symmetry,
> but it's not part of the spec.
>
> The positive look-behind should probably be allowed to contain captures
> that are still participating after the assertion succeeds (mirroring the  
> semantics of
> the positive look-ahead).
>
> I believe PCRE allows variable length (but structurally simple)  
> look-behinds, where
> the structure ensures that it doesn't have to do backtracking while  
> checking them, even
> though Perl itself does not [1]. Whether that's a desired property or  
> not is a different
> question (I would actually prefer a full backwards-executed regexp to an  
> artificial
> restriction, but that's mainly ideology :).
>
> /L
> [1] http://www.regular-expressions.info/lookaround.html
>
>>
>> http://perldoc.perl.org/perlre.html#Extended-Patterns
>>
>> Updated Proposal:
>> https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM
>>
>>
>>> Is there an example of a language that supports the full regexp power
>>> in lookbehinds so we can look at their experiences with implementing
>>> it?
>>
>>
>> As far as I know Perl is the de facto standard.
>>
>>
>>>
>>>
>>> 2010/11/15 Marc Harter <wavded at gmail.com>:
>>> > Brendan et al.,
>>> >
>>> > I have created a proposal for look-behind provided at this link:
>>> >
>>> >  
>>> https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM
>>> >
>>> > I hope it is a format that will be helpful for discussion with TC39.
>>> > Admittedly, I have never written one of these before so am  
>>> completely open
>>> > to any feedback or ways to improve the document from yourself or  
>>> anyone else
>>> > on this list.
>>> >
>>> > Marc
>>> >
>>> > On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote:
>>> >
>>> > I would be game to write up a proposal for this.  When would you need
>>> > this by to discuss w/ TC39?
>>> >
>>> > Thanks for your consideration,
>>> > Marc
>>> >
>>> > On Nov 12, 2010, at 5:04 PM, Brendan Eich <brendan at mozilla.com>  
>>> wrote:
>>> >
>>> >> On Nov 12, 2010, at 2:52 PM, Marc Harter wrote:
>>> >>
>>> >>> After considering all the breadth this discussion could take maybe  
>>> it
>>> >>> would be wise to just focus on one issue at a time.  For me, the  
>>> biggest
>>> >>> missing feature is lookbehind.  Its common to most languages
>>> >>> implementing the Perl-RegExp-syntax, it is very useful when  
>>> looking for
>>> >>> patterns that follow or don't follow a particular pattern.  I  
>>> guess I'm
>>> >>> confused why lookahead made it in but not lookbehind.
>>> >>
>>> >> This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!),  
>>> but we
>>> >> proposed to ECMA TC39 TG1 (the JS group -- things were different  
>>> then,
>>> >> including capitalization) something based on Perl 5. We didn't get
>>> >> everything, and we had to rationalize some obvious quirks.
>>> >>
>>> >> I don't remember lookbehind (which emerged in Perl 5.005 in July  
>>> '98)
>>> >> being left out on purpose. Waldemar may recall more, I'd handed him  
>>> the JS
>>> >> keys inside netscape.com to go do mozilla.org.
>>> >>
>>> >> If you are game to write a proposal or mini-spec (in the style of  
>>> ES5
>>> >> even), let me know. I'll chat with other TC39'ers next week about  
>>> this.
>>> >>
>>> >> /be
>>> >>
>>> >>
>>> >>> What do people
>>> >>> think about including this feature?
>>> >>>
>>> >>> Marc
>>> >>>
>>> >>> On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote:
>>> >>>> I will start out with a disclaimer.  I have not read both  
>>> ECMAScript
>>> >>>> specifications for 3 and now 5, so I admit that I am not an  
>>> expert in
>>> >>>> the spec itself but as I user of JavaScript, I would like to get  
>>> some
>>> >>>> expert discussion over this topic as proposed enhancements to the
>>> >>>> RegExp engine for Harmony.
>>> >>>>
>>> >>>> I will start with a list of lacking features in JS as compared to  
>>> Perl
>>> >>>> provided by (http://www.regular-expressions.info/javascript.html):
>>> >>>>
>>> >>>>     * No \A or \Z anchors to match the start or end of the string.
>>> >>>>       Use a caret or dollar instead.
>>> >>>>     * Lookbehind is not supported at all. Lookahead is fully
>>> >>>>       supported.
>>> >>>>     * No atomic grouping or possessive quantifiers
>>> >>>>     * No Unicode support, except for matching single characters  
>>> with
>>> >>>>       \uFFFF
>>> >>>>     * No named capturing groups. Use numbered capturing groups
>>> >>>>       instead.
>>> >>>>     * No mode modifiers to set matching options within the regular
>>> >>>>       expression.
>>> >>>>     * No conditionals.
>>> >>>>     * No regular expression comments. Describe your regular
>>> >>>>       expression with JavaScript // comments instead, outside the
>>> >>>>       regular expression string.
>>> >>>>
>>> >>>> I don't know if all of these "need" to be in the language but  
>>> there
>>> >>>> have been some that I have personally wanted to use:
>>> >>>>
>>> >>>>     * Lookbehind!  ECMAScript fully supports lookahead, why not
>>> >>>>       lookbehind?  Seems like a big hole to me.
>>> >>>>     * Named capturing groups and comments (e.g.
>>> >>>>       http://xregexp.com/syntax/).  Mostly I argue for this  
>>> because
>>> >>>>       it makes RegExp matches more self-documenting.  Regular
>>> >>>>       Expressions are already cryptic as it is.
>>> >>>>
>>> >>>> I do like some of the new flags proposed in
>>> >>>> (http://xregexp.com/flags/) but personally haven't used them but  
>>> maybe
>>> >>>> that is something also for discussion.
>>> >>>>
>>> >>>> Marc Harter
>>> >>>
>>> >>> _______________________________________________
>>> >>> es-discuss mailing list
>>> >>> es-discuss at mozilla.org
>>> >>> https://mail.mozilla.org/listinfo/es-discuss
>>> >>
>>> >
>>> > _______________________________________________
>>> > es-discuss mailing list
>>> > es-discuss at mozilla.org
>>> > https://mail.mozilla.org/listinfo/es-discuss
>>> >
>>> >
>
>


-- 
Lasse Reichstein - reichsteinatwork at gmail.com


More information about the es-discuss mailing list