RegExp pet peeves (was: should calling RegExp constructor as function without arguments throw?)

Steven L. steves_list at hotmail.com
Wed Jan 14 17:05:48 PST 2009


Lasse R.H. Nielsen wrote:

> The only difference between an Atom and an Assertion is that the former
> can have a quantifier attached. There is absolutely no reason to put a
> quantifier on a look-ahead, and look-aheads are zero-width matches just
> like all assertions, so they would fit much better as assertions.
> Changing the grammar to make look-aheads actual assertions wouldn't even
> require implementations to change. It would just change quantified
> look-aheads from being standard to being an extension

There's no benefit to the end user from making this non-backwards compatible change. In fact, if ES were to change the behavior of backreferences to non-participating groups to make them fail rather than match the empty string (which I think would be very positive, and this has been discussed here previously), (?=(x)?) would become a potentially useful construct, and (?=(x))? would be a valid alternative syntax.

Lasse R.H. Nielsen wrote:

> The problem with back-references is that the requirement prevents
> a one-pass parser, because you need to scan the entire regexp to
> know whether a decimal escape is valid. Well, actually it wouldn't 
> be a problem if you didn't want to be compatible with all the 
> current implementations that treat invalid decimal escapes as 
> octal escapes - so you need to know whether a given decimal sequence
> is a valid back-reference in order to parse it as octal if it isn't
> valid.

I should look over the precise spec rules regarding this since I don't remember if/when \1 would/could be treated as an escaped literal "1" (and other related issues that I'm currently hazy on), but as you noted, the spec does not call for it to match octal character index 1. Forward references such as (\2two|(one))+ would actually be useful (matching "oneonetwo", and I believe this is what happens with .NET, Java, Perl, PCRE, and Ruby regexes) if ES didn't call for resetting the backreference value when repeating the group (an ES regex pet peeve of my own).

I'm not proposing any ES changes in this email. I'm just saying that if ES regex changes are in order, I'd be proposing different ones from Lasse.
-------------------------
Steven Levithan
Baghdad, Iraq
http://stevenlevithan.com



From: brendan at mozilla.com
To: atwork at infimum.dk
Subject: Re: RegExp pet peeves (was: should calling RegExp constructor as	function without arguments throw?)
Date: Wed, 14 Jan 2009 14:56:56 -0800
CC: es-discuss at mozilla.org

This is really a separate thread -- please change the subject accordingly.
See also past messages here, which linked to
http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/http://blog.stevenlevithan.com/archives/npcg-javascript
If you want access, I will add you to http://bugs.ecmascript.org/ so you can file tickets on your peeves.
/be
On Jan 14, 2009, at 2:01 PM, Lasse R.H. Nielsen wrote:On Wed, 14 Jan 2009 14:13:13 +0100, Hallvord R. M. Steen <hallvord at opera.com> wrote:

Apologies if this has already been covered, I tried
googling but found only tangentially related stuff about "/regexp/()"
syntax.

There are a few parts of the regexp syntax that wouldn't mind a look-over.

My two primary pee-ve's are that look-aheads are Atoms, not Assertions,
and that back-references to captures occuring later in the source, are 
valid. 

The only difference between an Atom and an Assertion is that the former
can have a quantifier attached. There is absolutely no reason to put a
quantifier on a look-ahead, and look-aheads are zero-width matches just
like all assertions, so they would fit much better as assertions.
Changing the grammar to make look-aheads actual assertions wouldn't even
require implementations to change. It would just change quantified
look-aheads from being standard to being an extension, like so many
other things in regexps already are. (The feature was only added to 
JSC recently - I'm guessing nobody had needed it).

The problem with back-references is that the requirement prevents
a one-pass parser, because you need to scan the entire regexp to
know whether a decimal escape is valid. Well, actually it wouldn't 
be a problem if you didn't want to be compatible with all the 
current implementations that treat invalid decimal escapes as 
octal escapes - so you need to know whether a given decimal sequence
is a valid back-reference in order to parse it as octal if it isn't
valid.
At least IE6 actually limits the valid back-references to the
captures that were started previous to the back-reference in the
source. That's a reasonable approach from a parsing perspective
(I'd be happy if that was what was required), but really you only 
need to be able to reference captures that can be completed at the 
point where they occour, i.e., where both the start and end parentheses 
of the capture being referenced occur prior to the back-reference in 
the source.


/L 
-- 
Lasse R.H. Nielsen
Speaking only for myself ... if even that.
'Faith without judgement merely degrades the spirit divine'
_______________________________________________
Es-discuss mailing list
Es-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


_________________________________________________________________
Windows Live™ Hotmail®: Chat. Store. Share. Do more with mail. 
http://windowslive.com/explore?ocid=TXT_TAGLM_WL_t1_hm_justgotbetter_explore_012009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20090114/03eb5948/attachment-0001.html>


More information about the Es-discuss mailing list