Suggested RegExp Improvements

Erik Corry erik.corry at gmail.com
Tue Nov 16 13:54:43 PST 2010


2010/11/16 Mike Samuel <mikesamuel at gmail.com>:
> 2010/11/16 Erik Corry <erik.corry at gmail.com>:
>> 2010/11/15 Marc Harter <wavded at gmail.com>:
>>> On Mon, 2010-11-15 at 14:06 +0100, Erik Corry wrote:
>>>
>>>> Your proposal seems to allow variable length lookbehind.  This isn't
>>>> allowed in perl as far as I know.  I just tried the following:
>>>
>>>>  perl -e '"foobarbaz" =~ /a(?<=(ob|bab))/;'
>>>
>>>> which gives an error on perl5.  I think if we are going to allow
>>>> variable length lookbehind we should first find out why they don't
>>>> have it in perl.  I think the implementation is a little tricky if you
>>>> want to support the full regexp language in lookbehinds.
>>>
>>> This was not my intention.  I am proposing zero-width lookbehind, which
>>> would not allow for the case you specified above.  I will update the
>>
>> The issue is not with the number of characters consumed by the
>> assertion.  This is indeed zero.  The issue is with the width of the
>> text matched by the disjunction inside the brackets.  This is not any
>> disjunction, but rather a restricted part of the regexp language that
>> can only match a particular number of characters.
>>
>> It seems the .Net regexp library is able to handle arbitrary content
>> in a lookbehind.  It is almost the only one.
>>
>> See http://www.regular-expressions.info/lookaround.html#lookbehind for
>> more details.
>>
>> We could add this feature to JS.  As far as I can work out it
>> presupposes the ability to reverse an arbitrary regexp and run it
>> backwards (stepping back and backtracking forwards).  I don't think we
>> should add it accidentally though, and perhaps the proposer should be
>> the first to implement it.
>
> Don't you already have to do that to efficiently handle a regexp that
> ends at the end of the input (in JS, a non multiline $, or \z in
> java.util.regex parlance)?

V8 doesn't have a general form of that optimization.  Do the others?

> If you have the whole input string available in memory, and are trying
> to figure out whether a lookbehind (?<=x) matches at position p, can't
> you just test /(?:x)$/ against the prefix of the input of length p.
>
>
>>> proposal.  It is my understanding that lookahead as implemented in
>>> ECMAScript also is zero-width and not variable.  This is also how Perl has
>>> implemented lookbehind.
>>>
>>> http://perldoc.perl.org/perlre.html#Extended-Patterns
>>>
>>> Updated Proposal:
>>> https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM
>>
>> The issue is not that the regexp doesn't match in perl. The issue is
>> that it is not compiled at all.
>>
>>>
>>> Is there an example of a language that supports the full regexp power
>>> in lookbehinds so we can look at their experiences with implementing
>>> it?
>>>
>>> As far as I know Perl is the de facto standard.
>>>
>>>
>>>
>>> 2010/11/15 Marc Harter <wavded at gmail.com>:
>>>> Brendan et al.,
>>>>
>>>> I have created a proposal for look-behind provided at this link:
>>>>
>>>>
>>>> https://docs.google.com/document/pub?id=1EUHvr1SC72g6OPo5fJjelVESpd4nI0D5NQpF3oUO5UM
>>>>
>>>> I hope it is a format that will be helpful for discussion with TC39.
>>>> Admittedly, I have never written one of these before so am completely open
>>>> to any feedback or ways to improve the document from yourself or anyone
>>>> else
>>>> on this list.
>>>>
>>>> Marc
>>>>
>>>> On Sat, 2010-11-13 at 09:32 -0600, Marc Harter wrote:
>>>>
>>>> I would be game to write up a proposal for this.  When would you need
>>>> this by to discuss w/ TC39?
>>>>
>>>> Thanks for your consideration,
>>>> Marc
>>>>
>>>> On Nov 12, 2010, at 5:04 PM, Brendan Eich <brendan at mozilla.com> wrote:
>>>>
>>>>> On Nov 12, 2010, at 2:52 PM, Marc Harter wrote:
>>>>>
>>>>>> After considering all the breadth this discussion could take maybe it
>>>>>> would be wise to just focus on one issue at a time.  For me, the biggest
>>>>>> missing feature is lookbehind.  Its common to most languages
>>>>>> implementing the Perl-RegExp-syntax, it is very useful when looking for
>>>>>> patterns that follow or don't follow a particular pattern.  I guess I'm
>>>>>> confused why lookahead made it in but not lookbehind.
>>>>>
>>>>> This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but
>>>>> we
>>>>> proposed to ECMA TC39 TG1 (the JS group -- things were different then,
>>>>> including capitalization) something based on Perl 5. We didn't get
>>>>> everything, and we had to rationalize some obvious quirks.
>>>>>
>>>>> I don't remember lookbehind (which emerged in Perl 5.005 in July '98)
>>>>> being left out on purpose. Waldemar may recall more, I'd handed him the
>>>>> JS
>>>>> keys inside netscape.com to go do mozilla.org.
>>>>>
>>>>> If you are game to write a proposal or mini-spec (in the style of ES5
>>>>> even), let me know. I'll chat with other TC39'ers next week about this.
>>>>>
>>>>> /be
>>>>>
>>>>>
>>>>>> What do people
>>>>>> think about including this feature?
>>>>>>
>>>>>> Marc
>>>>>>
>>>>>> On Fri, 2010-11-12 at 16:20 -0600, Marc Harter wrote:
>>>>>>> I will start out with a disclaimer.  I have not read both ECMAScript
>>>>>>> specifications for 3 and now 5, so I admit that I am not an expert in
>>>>>>> the spec itself but as I user of JavaScript, I would like to get some
>>>>>>> expert discussion over this topic as proposed enhancements to the
>>>>>>> RegExp engine for Harmony.
>>>>>>>
>>>>>>> I will start with a list of lacking features in JS as compared to Perl
>>>>>>> provided by (http://www.regular-expressions.info/javascript.html):
>>>>>>>
>>>>>>>     * No \A or \Z anchors to match the start or end of the string.
>>>>>>>       Use a caret or dollar instead.
>>>>>>>     * Lookbehind is not supported at all. Lookahead is fully
>>>>>>>       supported.
>>>>>>>     * No atomic grouping or possessive quantifiers
>>>>>>>     * No Unicode support, except for matching single characters with
>>>>>>>       \uFFFF
>>>>>>>     * No named capturing groups. Use numbered capturing groups
>>>>>>>       instead.
>>>>>>>     * No mode modifiers to set matching options within the regular
>>>>>>>       expression.
>>>>>>>     * No conditionals.
>>>>>>>     * No regular expression comments. Describe your regular
>>>>>>>       expression with JavaScript // comments instead, outside the
>>>>>>>       regular expression string.
>>>>>>>
>>>>>>> I don't know if all of these "need" to be in the language but there
>>>>>>> have been some that I have personally wanted to use:
>>>>>>>
>>>>>>>     * Lookbehind!  ECMAScript fully supports lookahead, why not
>>>>>>>       lookbehind?  Seems like a big hole to me.
>>>>>>>     * Named capturing groups and comments (e.g.
>>>>>>>       http://xregexp.com/syntax/).  Mostly I argue for this because
>>>>>>>       it makes RegExp matches more self-documenting.  Regular
>>>>>>>       Expressions are already cryptic as it is.
>>>>>>>
>>>>>>> I do like some of the new flags proposed in
>>>>>>> (http://xregexp.com/flags/) but personally haven't used them but maybe
>>>>>>> that is something also for discussion.
>>>>>>>
>>>>>>> Marc Harter
>>>>>>
>>>>>> _______________________________________________
>>>>>> es-discuss mailing list
>>>>>> es-discuss at mozilla.org
>>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>
>>>>
>>>> _______________________________________________
>>>> es-discuss mailing list
>>>> es-discuss at mozilla.org
>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>
>>>>
>>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>


More information about the es-discuss mailing list