Suggested RegExp Improvements

Mike Samuel mikesamuel at gmail.com
Tue Nov 16 09:25:35 PST 2010


2010/11/16 Erik Corry <erik.corry at gmail.com>:
> 2010/11/15 Marc Harter <wavded at gmail.com>:
>> On Mon, 2010-11-15 at 14:06 +0100, Erik Corry wrote:
>>
>>> Your proposal seems to allow variable length lookbehind.  This isn't
>>> allowed in perl as far as I know.  I just tried the following:
>>
>>>  perl -e '"foobarbaz" =~ /a(?<=(ob|bab))/;'
>>
>>> which gives an error on perl5.  I think if we are going to allow
>>> variable length lookbehind we should first find out why they don't
>>> have it in perl.  I think the implementation is a little tricky if you
>>> want to support the full regexp language in lookbehinds.
>>
>> This was not my intention.  I am proposing zero-width lookbehind, which
>> would not allow for the case you specified above.  I will update the
>
> The issue is not with the number of characters consumed by the
> assertion.  This is indeed zero.  The issue is with the width of the
> text matched by the disjunction inside the brackets.  This is not any
> disjunction, but rather a restricted part of the regexp language that
> can only match a particular number of characters.
>
> It seems the .Net regexp library is able to handle arbitrary content
> in a lookbehind.  It is almost the only one.
>
> See http://www.regular-expressions.info/lookaround.html#lookbehind for
> more details.
>
> We could add this feature to JS.  As far as I can work out it
> presupposes the ability to reverse an arbitrary regexp and run it
> backwards (stepping back and backtracking forwards).  I don't think we
> should add it accidentally though, and perhaps the proposer should be
> the first to implement it.

Don't you already have to do that to efficiently handle a regexp that
ends at the end of the input (in JS, a non multiline $, or \z in
java.util.regex parlance)?
If you have the whole input string available in memory, and are trying
to figure out whether a lookbehind (?<=x) matches at position p, can't
you just test /(?:x)$/ against the prefix of the input of length p.


>> proposal.  It is my understanding that lookahead as implemented in
>> ECMAScript also is zero-width and not variable.  This is also how Perl has
>> implemented lookbehind.
>>
>> http://perldoc.perl.org/perlre.html#Extended-Patterns
>>
>> Updated Proposal:
>> https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM
>
> The issue is not that the regexp doesn't match in perl. The issue is
> that it is not compiled at all.
>
>>
>> Is there an example of a language that supports the full regexp power
>> in lookbehinds so we can look at their experiences with implementing
>> it?
>>
>> As far as I know Perl is the de facto standard.
>>
>>
>>
>> 2010/11/15 Marc Harter <wavded at gmail.com>:
>>> Brendan et al.,
>>>
>>> I have created a proposal for look-behind provided at this link:
>>>
>>>
>>> https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM
>>>
>>> I hope it is a format that will be helpful for discussion with TC39.
>>> Admittedly, I have never written one of these before so am completely open
>>> to any feedback or ways to improve the document from yourself or anyone
>>> else
>>> on this list.
>>>
>>> Marc
>>>
>>> On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote:
>>>
>>> I would be game to write up a proposal for this.  When would you need
>>> this by to discuss w/ TC39?
>>>
>>> Thanks for your consideration,
>>> Marc
>>>
>>> On Nov 12, 2010, at 5:04 PM, Brendan Eich <brendan at mozilla.com> wrote:
>>>
>>>> On Nov 12, 2010, at 2:52 PM, Marc Harter wrote:
>>>>
>>>>> After considering all the breadth this discussion could take maybe it
>>>>> would be wise to just focus on one issue at a time.  For me, the biggest
>>>>> missing feature is lookbehind.  Its common to most languages
>>>>> implementing the Perl-RegExp-syntax, it is very useful when looking for
>>>>> patterns that follow or don't follow a particular pattern.  I guess I'm
>>>>> confused why lookahead made it in but not lookbehind.
>>>>
>>>> This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but
>>>> we
>>>> proposed to ECMA TC39 TG1 (the JS group -- things were different then,
>>>> including capitalization) something based on Perl 5. We didn't get
>>>> everything, and we had to rationalize some obvious quirks.
>>>>
>>>> I don't remember lookbehind (which emerged in Perl 5.005 in July '98)
>>>> being left out on purpose. Waldemar may recall more, I'd handed him the
>>>> JS
>>>> keys inside netscape.com to go do mozilla.org.
>>>>
>>>> If you are game to write a proposal or mini-spec (in the style of ES5
>>>> even), let me know. I'll chat with other TC39'ers next week about this.
>>>>
>>>> /be
>>>>
>>>>
>>>>> What do people
>>>>> think about including this feature?
>>>>>
>>>>> Marc
>>>>>
>>>>> On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote:
>>>>>> I will start out with a disclaimer.  I have not read both ECMAScript
>>>>>> specifications for 3 and now 5, so I admit that I am not an expert in
>>>>>> the spec itself but as I user of JavaScript, I would like to get some
>>>>>> expert discussion over this topic as proposed enhancements to the
>>>>>> RegExp engine for Harmony.
>>>>>>
>>>>>> I will start with a list of lacking features in JS as compared to Perl
>>>>>> provided by (http://www.regular-expressions.info/javascript.html):
>>>>>>
>>>>>>     * No \A or \Z anchors to match the start or end of the string.
>>>>>>       Use a caret or dollar instead.
>>>>>>     * Lookbehind is not supported at all. Lookahead is fully
>>>>>>       supported.
>>>>>>     * No atomic grouping or possessive quantifiers
>>>>>>     * No Unicode support, except for matching single characters with
>>>>>>       \uFFFF
>>>>>>     * No named capturing groups. Use numbered capturing groups
>>>>>>       instead.
>>>>>>     * No mode modifiers to set matching options within the regular
>>>>>>       expression.
>>>>>>     * No conditionals.
>>>>>>     * No regular expression comments. Describe your regular
>>>>>>       expression with JavaScript // comments instead, outside the
>>>>>>       regular expression string.
>>>>>>
>>>>>> I don't know if all of these "need" to be in the language but there
>>>>>> have been some that I have personally wanted to use:
>>>>>>
>>>>>>     * Lookbehind!  ECMAScript fully supports lookahead, why not
>>>>>>       lookbehind?  Seems like a big hole to me.
>>>>>>     * Named capturing groups and comments (e.g.
>>>>>>       http://xregexp.com/syntax/).  Mostly I argue for this because
>>>>>>       it makes RegExp matches more self-documenting.  Regular
>>>>>>       Expressions are already cryptic as it is.
>>>>>>
>>>>>> I do like some of the new flags proposed in
>>>>>> (http://xregexp.com/flags/) but personally haven't used them but maybe
>>>>>> that is something also for discussion.
>>>>>>
>>>>>> Marc Harter
>>>>>
>>>>> _______________________________________________
>>>>> es-discuss mailing list
>>>>> es-discuss at mozilla.org
>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>
>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>>
>>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>


More information about the es-discuss mailing list