Regex: How should backreferences contained in the capturing match they reference to work?

liorean liorean at gmail.com
Fri Sep 14 12:39:15 PDT 2007


> On 9/13/07, liorean <liorean at gmail.com> wrote:
> > That the spec doesn't match expectations and that there are behaviours
> > that do make sense to replace it with, coupled with the fact there
> > seems to be no obvious compatibility problem with changing it
> > (otherwise JScript and JavaScriptCore surely would have been changed
> > to match the ES3 behaviour), makes no compelling reason?

On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:
> I hope I'm not being overly flip when I say that it is the spec that
> circumscribes the set of expectations you are allowed to have.  And
> the spec is entirely clear here, ie what it says is not in question,
> even if it's not always easy to find out what it says.  A somewhat
> determined developer who needs to rely on the behavior can discover
> what behavior is expected (though he obviously can't trust the
> implementations to get it right).

Expectations are never formed from the spec, and that is especially
true of ECMAScript since the spec is quite hard to read for laymen.
Expectations are formed from a logical application of the perceived
concepts underlying the language features. Regex are a special case
here as well, since most regex articles and tutorials are not
particularly language specific, they are typically only specific to
the generalised flavour (the Perl flavour and POSIX flavour,
respectively). In fact, for a long time the only JavaScript specific
regex article that a fast search could find that was the one I wrote
for eVolt in 2002.


> It's your contention that developers will be surprised by the current
> behavior.  What can I say? Edge cases will always surprise somebody.
> I don't feel surprised that captures are thrown away when a repeat
> matcher repeats, even if you do.

Maybe you don't. But almost all regex implementations that implement
capturing submatches that are specifically implementing the ES3
algorithm agree with me.

> Warnings are pointless on the web, and the language has no warnings
> now.  If we're going to do something about backreferences inside their
> own captures, it would have to be to outlaw them and require a syntax
> error.

Warnings are pretty much useless today, yes. But you can always hope
for better debugging and development tools.

> I doubt we're inclined to make any incompatible changes to the
> matching algorithm, and (again) I don't really see the point.

That's why I stressed that JScript and JavaScriptCore, and according
to Brendan above also Tamarin, fail to implement this according to
ES3. It's not an incompatible change - it can't be if the two most
common engines are incompatible with the spec. (I gather that'd be
JScript and Tamarin.)  That pretty much guarantees that there's no
code relying on the spec behaviour.




On 14/09/2007, Lars T Hansen <lth at acm.org> wrote:
> I oppose this on merely Perl compatibility grounds -- ES regexes have
> since evolved through other influences -- but sure, we'll give it a
> fair hearing.

I'm glad to hear that it'll get a fair hearing.
Just throw some logs on the fire however, these all agree on
backreferences to failed submatches failing to match
anything-including-the-empty-string:

JGsoft, .NET, Java, Perl, PCRE, Python, Ruby, Tcl ARE, POSIX BRE

(according to <uri:http://www.regular-expressions.info/refflavors.html>,
I haven't actually tested in all those)

You can of course argue that "failed submatches fail to match" is not
the same as "not-yet-captured submatches fail to match" - that would
require some testing, and the only one that I have accessible
personally right now is Perl.




On 14/09/2007, Garrett Smith <dhtmlkitchen at gmail.com> wrote:
> You're post on ES4 list would make for an excellent tutorial on RegExps.

I've been asked to do RegExp tutorials before...

<uri:http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/>

> I think a lot of web devs, including myself, don't have such deep
> knowledge of RegExps. The differences in script engines show that it's
> more than web devs who misunderstand the spec. You're examples,
> including the ones on your blog, indicate that RegExps in js have some
> counter-intuitive behavior.

Yes. I've been meaning to flesh that out in another blog entry.
Probably will have to wait until next week however.

> I think it would be excellent material for an article on
> developer.mozilla.org. Something to consider, at least.

Ah. The problem of choosing where to place your articles... in fact,
I'd probably prefer to place any future articles on dev.opera.com (I
think Opera is seriously under appreciated by developers, so anything
to help...). Or on eVolt, since I like that place.







I think I can sum up the change I think is appropriate by these things:
- undefined should be a failure to match instead of a match to the empty string
- captures should only be set to undefined in two cases - when the
regex matching is started, and if inside a negative lookahead
-- 
David "liorean" Andersson



More information about the Es4-discuss mailing list