Proposal for exact matching and matching at a position in RegExp

Erik Corry erik.corry at gmail.com
Fri Feb 12 10:12:51 PST 2010


2010/2/12 Andy Chu <andy at chubot.org>:
> On Thu, Feb 11, 2010 at 10:24 PM, Steve L. <steves_list at hotmail.com> wrote:
>> Outside of es-discuss, Brendan Eich asked for my thoughts on the merits of
>> \G vs. /y (intrinsically and in light of backward compatibility). I sent the
>> following reply, which he thought would be useful to forward to the list....
>>
>> I have no preference between /y and \G. When I first saw /y proposed for
>> ES4, I felt it needlessly reinvented the wheel given that \G had already
>> been implemented pretty widely. On the other hand, the fact that \G reaches
>> out of the search pattern to read a property of a regex or string feels a
>> bit too much like magic to me, and implementing it as a flag (/y) seems less
>> weird. An argument in favor of \G is that it's more versatile than /y since
>> it can be used anywhere in a regex pattern (e.g., at the start of an
>> alternation option), not just as the leading element.
>
> Agree that \G breaks some logical barrier.  I like to have a mental
> model of the implementation internals, and \G breaks that a bit.

\G is more flexible and it is rather similar to ^ conceptually.

The mental model happens to be out of sync with how regexps are
implemented.  The implicit .*? at the start of a regexp is actually
the fastest way to implement since you are using the fast internal
search mechanisms that the regexp engine has rather than an external
loop that repeatedly asks "does it match here?".

Certainly if the /y variant is adopted then V8 will implement it as if
it were specified with \G.  Ie there would be two different regexps
behind the scenes, one with and one without /y.  This is similar to
what would happen if you could specify /i at match time instead of
compile time.

> If compatibility with Mozilla is not an issue, I actually prefer
> Python's approach of .search() vs. .match().  It's not a part of the
> regex; it's not a property of the regex; it's how you *apply* the
> regex to a string.  Just like you can apply the same regex with
> .split() or .exec() or .replace().  They're orthogonal issues in my
> mind.
>
> Though as mentioned, gracefully upgrading with ES3-5 is an issue, so I
> could only think of .exec() and .execLeft() for a left-anchored match.
>
> One thing I didn't bring up is that Python actually has an "endpos"
> argument.  You do regex.search(s, 10, 20), and it will stop at
> position 20.  I couldn't think of a real use case for this.  But
> anyone can think of one, that might be a consideration and sway things
> in favor of separate methods.
>
> Andy
>
>
>>
>> Note that \G works a bit differently across implementations. In some cases
>> it matches the start position of the current match (PCRE, Ruby), and
>> elsewhere it matches the end position of the previous match (Perl, Java,
>> .NET). Of course, this distinction only matters after a zero-length match
>> (since that increments the start position of the next search).
>>
>> Perl has extra functionality around \G that makes it more useful.
>> Specifically, the fact that the location associated with \G is an attribute
>> of target strings (pos()) means that multiple regexes with \G can match
>> against a string in turn and they'll each pick up where the others left off.
>> Combine this with Perl's /c modifier (which prevents failed matches from
>> resetting the \G location) and you can run multiple regexes with \G and /c
>> against a string and advance only when there's a match. Here's a crappy
>> example:
>>
>> while ($html !~ /\G$/gc) {
>>   if ($html =~ /\G[^<&]+/gc) {
>>       ...
>>   } elsif ($html =~ /\G<(\w+)[^>]+>/gc) {
>>       ...
>>   } elsif ($html =~ /\G&#?\w+;/gc) {
>>       ...
>>   }
>> }
>>
>> Sorry for the tangent, but I thought it might be helpful to describe how \G
>> is used elsewhere.
>>
>> Steven Levithan
>> http://blog.stevenlevithan.com
>>
>> --------------------------------------------------
>> From: "Steve L." <steves_list at hotmail.com>
>> Sent: Wednesday, February 10, 2010 10:46 AM
>> To: "Andy Chu" <andy at chubot.org>; "es-discuss" <es-discuss at mozilla.org>
>> Subject: Re: Proposal for exact matching and matching at a position in
>> RegExp
>>
>>>
>>>>> http://andychu.net/ecmascript/RegExp-Enhancements-2.html
>>>>>
>>>>> Basically the proposal is to add parameters which can override the
>>>>> internal state of the RegExp.
>>>>
>>>> Does anyone have any comments on this?
>>>>
>>>> Can I put it in a place where it will be considered for the next
>>>> ECMAScript?  The overall idea seems relatively uncontroversial since
>>>> it was already implemented by Mozilla (for the exact same reason).  I
>>>> have proposed a specific API enhancement too.
>>>
>>> I do not believe it was implemented for "the exact same reason." It seems
>>> you are merely looking for a way to match exactly at a given character
>>> position, and you correctly note that /y is not an elegant solution for
>>> this problem. However, although /y can be used to solve this problem, my
>>> understanding is that it was designed to work similarly to the \G regex
>>> token from Perl/PCRE/Java/.NET/etc. while tying in nicely with the
>>> lastIndex property. An important feature of /y (and \G from other regex
>>> flavors) is that, with global regexes (compiled with /g), each successive
>>> match must start where the last match ended. This is a very useful feature
>>> for writing some types of simple parsers, etc.  And in the process of
>>> smartly solving this problem, you get an inelegant solution to your
>>> problem as a side effect, free of charge.
>>>
>>> Steven Levithan
>>> http://blog.stevenlevithan.com
>>>
>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>


More information about the es-discuss mailing list