Regexp APIs and capturing group positions

Steven Levithan steves_list at hotmail.com
Thu Jul 12 16:00:29 PDT 2012


Definite +1 on adding some way to determine capturing group match start positions.

Mark Macdonald wrote:

> This makes it hard to write something like a regex coach, which takes an arbitrary regular expression and input string, and outputs a highlighted version of the input string showing where the capturing groups matched.

It makes it impossible to do accurately. There are a few crude approaches I’ve pursued in the past to make something like this work with handcrafted regexes in some cases only. E.g., you can use `str.replace()` to insert markers before and after backreferences, and then check for the position of your markers after the fact. But that certainly won’t work with any arbitrary regex fed in via a regex tester. Captured subpatterns might not even appear within the text of the match, due to lookahead.

There are several JavaScript regex testers that try to report backreference positions, but they are incredibly easy to fool. See, e.g., http://leaverou.github.com/regexplained/ (plus https://github.com/LeaVerou/regexplained/issues/7 ) and http://www.gethifi.com/tools/regex .

As for the proposed implementation, I have a few concerns:

1. The main issue I see is that the proposal doesn’t provide a clean way to support named backreferences, should a future version of ES add named capturing groups. Future ES might want to share the proposed `captures` array or object for providing named backreferences, as well as their match positions. http://xregexp.com/syntax/named_capture_comparison/ shows where named backreferences are stored in various regex flavors (usually accessible via a method named `group()` or `groups()`, although XRegExp stores named backreference properties directly on the result array).

2. Keep in mind that, since `str.match(nonglobalregex)` is an alias of `regex.exec(str)`, anything added to `regex.exec()` should also be added to the nonglobal `str.match()` overload.

3. IMO, the name `captures` is misleading, given the specific proposal, since it seems to suggest that it stores the backreferences themselves, rather than their start positions.

4. I dislike the idea of excluding backreference zero (i.e., the entire match) from any result array.

5. The proposal does not mention what should happen when trying to access the start position of a nonparticipating capturing group. Presumably, the value should be `null` or `undefined`.

Thanks for mentioning the prior art of `java.util.Matcher.start()` and Python's `re.MatchObject.start()`. They offer an alternative design that might be better, assuming that adding a `start()` method to the array returned by `exec()` is an acceptable solution. If future ES adopts named capture, it would be easy to change such a method to accept strings in addition to integers. (For whatever reason [probably just omission], Java 7’s `Matcher.start()` doesn’t accept strings, even though Java 7 supports named capture.)

Another potential way to do this might be to change the string primitives in the `exec()` result array to `String` objects that can store properties. Then you could do something like `/.(.)/.exec('foo')[1].start === 1`.

--Steven Levithan


From: Mark Macdonald 
Sent: Thursday, July 12, 2012 2:26 PM
To: es-discuss at mozilla.org 
Subject: Regexp APIs and capturing group positions

In ES 5.1, the regular expression APIs do not expose the index at which a capturing group matched. The RegExp.prototype.exec(string) function returns an Array giving (among other things) the text matched by capturing groups, but does not give the positions of the captured text within the input string.

For example, consider this code using the current regex APIs:


var match = /(fox).*(dog)/.exec("The quick brown fox jumps over the lazy dog");
match[1]; // "fox"
match[2]; // "dog"


We want to get this:
"fox" at index 16
"dog" at index 40

But there is no way to obtain the indices 16, 40 from the match object (or any other API I'm aware of). This makes it hard to write something like a regex coach, which takes an arbitrary regular expression and input string, and outputs a highlighted version of the input string showing where the capturing groups matched.

Proposal: When RegExp.prototype.exec(string) returns a nonnull value, the returned object shall have a property named "captures", which is an Array. The value of captures[n] is the index at which the n'th capturing group's match begins. As usual, groups are numbered from 1. The captures array does not have a "0" property (it would always be equal to the "index" property of the match object, and thus redundant).

Proposed code:


var match = /(fox).*(dog)/.exec("The quick brown fox jumps over the lazy dog");
match.captures[1]; // 16
match.captures[2]; // 40


This (combined with the group text from the match object) gives you enough information to enumerate the captured regions of the input string.

Prior art: Java's java.util.Matcher.start() [1], Python's re.MatchObject.start() [2].

Comments, suggestions?

Mark

[1] http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html#start%28int%29
[2] http://docs.python.org/library/re.html#match-objects

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120712/5d0f420b/attachment-0001.html>


More information about the es-discuss mailing list