RegExp syntax suggestion: allow CharacterClassEscape in CharacterRange

Gavin Barraclough barraclough at
Tue Dec 28 17:56:47 PST 2010

Hi Lasse,

Thank you for the comprehensive list!

In all these cases we should perhaps first attempt to determine whether these could just be considered spec compliance issues that ought to be fixed in implementations.  If retaining compatibility with existing code is going to require continued support for syntax beyond that currently covered by the spec, then it would seem beneficial to try to add this.  I'd be interested in discussing some proposals to address the issues you've listed.

On Dec 9, 2010, at 12:33 PM, Lasse Reichstein wrote:
> // Invalid ControlEscape/IdentityEscape character treated as literal.
>  /\z/;  // Invalid escape, same as /z/

> To match the current behavior, IdentityEscape shouldn't exclude all of IdentifierPart,
> but only the characters that already mean something else.

Agreed - I expect this change will be necessary to retain compatibility with existing code.  String literals define all source characters not used in escape sequences - bar decimal digits - to be valid identity escapes (NonEscapeCharacter), so it would only seem consistent to do the same for RegExps.  A minimal but slightly messy fix would be:

IdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape nor CharacterClassEscape nor any of:
	b B c x u

ClassEscape ::

Adding semantics such that ClassEscape :: B generates the character set { B }.  A slightly cleaner solution may be to restructure this bit of BNF to give ClassEscapes and AtomEscapes separate copies of IdentityEscape such that special handling for /[\B]/ is not necessary:

AtomEscape ::

AtomIdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape nor CharacterClassEscape nor any of:
	b B c x u

CharacterEscape ::
	c ControlLetter

ClassEscape ::

ClassIdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape nor CharacterClassEscape nor any of:
	b c x u

Even if we did not need to liberalize IdentityEscape to this extent, the current definition in the spec seems a little unworkable, particularly since it appears to prohibit /\$/.  Defining IdentityEscape in terms of IdentifierPart also appears to have the consequence of requiring users to remember which set of punctuation falls within the UnicodeConnectorPunctuation set to correctly determine which punctuation marks they can validly escape this way!  I would suggest a still restrictive, but more practical, definition for IdentityEscape would be:

IdentityEscape ::
	SourceCharacter but not ControlLetter or DecimalDigit

> // Incomplete/Invalid ControlEscape treated as either "\\c" or "c"
>  /\c/;  // same as /c/ or /\\c/
>  /\c2/;  // same as /c2/ or /\\c2/
> // Incomplete HexEscapeSequence escape treated as either "\\x" or "x".
>  /\x/;  // incomplete x-escape
>  /\x1/;  // incomplete x-escape
>  /\x1z/;  // incomplete x-escape
> // Incomplete UnicodeEscapeSequence escape treated as either "\\u" or "u".
>  /\u/;  // incomplete u-escape
>  /\uz/;  // incomplete u-escape
>  /\u1/;  // incomplete u-escape
>  /\u1z/;  // incomplete u-escape
>  /\u12/;  // incomplete u-escape
>  /\u12z/;  // incomplete u-escape
>  /\u123/;  // incomplete u-escape
>  /\u123z/;  // incomplete u-escape

> Most of the RegExps treat a malformed (start of a multi-character) escape sequence
> as a simple identity escape or octal escape, and extends identity escapes to all characters
> that doesn't already have another meaning (ControlEscape, CharacterClassEscape or
> one of c, x, u, or b, and B outside a CharacterClass).

Agreed - though I think it would be preferable to try generating the syntax error in these cases, if this is feasible.

Given that these invalid sequences are currently being handled with some inconsistency across browsers, this might a good hint that we can treat this as a compliance bug that we can just fix in implementations.  Hex and unicode escapes are also present in string literals, and malformed escapes in string literals should also generate a syntax error (browser behaviour appears inconsistent here), and it would seem preferable to retain consistency between string & RegExp literals.

If we do need to be permissive in these cases, then I guess this requires up to four characters of lookahead.  I believe invalid escapes - if not a syntax error - should be handled as per IdentityEscapes (i.e. the backslash is not retained - /\c/ as /c/, not /\\c/).

> Allowing /\c2/ to match "c2", but requiring /\CB/ to match "\x02" seems like it would
> be better explained in prose than in the BNF.

Well, if we're going to allow /\c2/ to match "c2" then it seems we should also allow /\c/ to match "c", and I think that will require changes to the BNF.  Also, a prose explanation may be a little trickier than it appears at first blush, when you take quantifiers into account (if one trivially were to write semantics that would generate a nested pair of matchers for /\c2/ (to first match for { c }, then {2}), then I think this would result in /\c2*/ being handled as /(?:c2)*/ when the quantifier is applied - I'm sure there would be ways to avoid this, but would probably make a prose approach more complex).  And this might not actually be too bad to deal with in BNF anyway - I believe this ought to work?:

CharacterEscape ::
	c [lookahead ∉ ControlLetter]
	c ControlLetter

Then we would just need to add semantics to define the result of c [lookahead ∉ ControlLetter] to be the character set { c }.

> // Bad quantifier range:
>  /x{z/;  // same as /x\{z/
>  /x{1z/;  // same as /x\{1z/
>  /x{1,z/;  // same as /x\{1,z/
>  /x{1,2z/;  // same as /x\{1,2z/
>  /x{10000,20000z/;  // same as /x\{10000,20000z/
> // Notice: It needs arbitrary lookahead to determine the invalidity,
> // except Mozilla that limits the numbers.

Permitting open braces that don't form valid range quantifiers (per current browser implementations) leads to a couple of odd quirks:
- it seems inconsistent, given that syntax errors are generated for unclosed brackets and parentheses.
- it means that a correctly formed quantifier without an atom will throw a syntax error, but a malformed one will not.  E.g. /{0}/ throws a syntax error, /{0{/ does not.

If we can't change implementations to adopt spec behavior here, then one possible simple solution (assuming it is web compatible) might be to add the following pattern to the 'Atom' rule:

Atom ::
	\ AtomEscape
	( Disjunction )
	( ? : Disjunction )
	{ [lookahead ∉ DecimalDigit]

This would allow an unclosed brace in many situations, without requiring arbitrary lookahead, and would not be ambiguous with range quantifiers (since these always require the open brace to be followed by a decimal digit).
The drawback I see with this approach would be that it might appear arbitrary to accept /{a/ but reject /{1/.  This said, the spec already is more prone to generate syntax errors around decimal digits in literals, for example escaped decimal digits being a syntax error in string literals and RegExp character classes, so perhaps this would be okay.

> // Zero-initialized Octal escapes.
>  /\012/;    // same as /\x0a/

I've only tested this one on a couple of browsers, but those tested permit any number of zeros, followed by any octal sequence with a decimal value up to 255.  Here's a possible set of grammar rules that I think might be okay:

DecimalEscape ::
	Zeros OptionalLatin1OctalValue
	NonZeroDigit DecimalDigits 

Zeros ::
	Zeros 0

OptionalLatin1OctalValue ::
	[lookahead ∉ OctalDigit]
	OneToThree [lookahead ∉ OctalDigit]
	FourToSeven [lookahead ∉ OctalDigit]
	OneToThree OctalDigit [lookahead ∉ OctalDigit]
	FourToSeven OctalDigit
	OneToThree OctalDigit OctalDigit

OneToThree :: one of
	1 2 3

FourToSeven :: one of
	4 5 6 7

(Alternatively this syntax could be tightened up a little to restrict octal values to a single leading zero, if implementations can do so too).

> // Nonexisting back-references treated as octal escapes:
>  /\5/;  // same as /\x05/

I really hope we can remove this 'feature' from implementations.  I'm going to hold off for now on attempting to propose grammar.  This one does look a little complex to write BNF for, I guess we can attampt a prose solution, though again this will run into problems in the case of quantifiers.

(OOI there is some inconsistency in how this is handled in browsers - for example, is /\777/ equivalent to /\x3f/ or /\x3f7/).

> // Invalid PatternCharacter accepted unescaped
>  /]/;
>  /{/;
>  /}/;

'{' is addressed above, fixing the spec to permit ']' and '}' would be easy, and would mean modifying this rule:

PatternCharacter :: SourceCharacter but not any of:
	^ $ \ . * + ? ( ) [ ] { } |

However it would introduce something of an inconsistency if we permit unmatched ']' and '}" but not ')'.

* One possibility would be to relax both the spec and implementations, and allow all types of close brackets - ']' and '}" and ')' (though I think permitting ')' might require a little ugly duplication in the BNF).
* Depending on the resolution to bad quantifier ranges, above, another option might be to permit an unmatched close brace, but start generating a syntax error for unmatched close brackets (and retain the error for unmatched closed parentheses) - i.e. if we do permit unmatched open brace, then perhaps we should also be permitting unmatched close brace.

> // Bad escapes also inside CharacterClass.
>  /[\z]/;
>  /[\c]/;
>  /[\c2]/;
>  /[\x]/;
>  /[\x1]/;
>  /[\x1z]/;
>  /[\u]/;
>  /[\uz]/;
>  /[\u1]/;
>  /[\u1z]/;
>  /[\u12]/;
>  /[\u12z]/;
>  /[\u123]/;
>  /[\u123z]/;
>  /[\012]/;
>  /[\5]/;
> // And in addition:
>  /[\B]/;
>  /()()[\2]/;  // Valid backreference should be invalid.

In the case of hex and unicode escapes, I believe handling of ClassEscapes matches that for AtomEscapes, and this should remain the case.

I'm not sure about other browsers, but I am aware that Safari and Firefox both extend the syntax of control letters in character classes to include DecimalDigits and '_', so /[\c2]/ is accepted and is regarded as a control letter, unlike /\c2/ (which neither generates a syntax error, nor is treated as a control letter) - I believe that is both browsers /\c2/ is equivalent to /c2/ but /[\c2]/ is equivalent to /[\x12]/ (or /\x12/).  Hopefully this extension can simply be removed – having incompatible syntax here between ClassEscapes and AtomEscapes seems unhelpful.

Since most browsers seem to permit decimal characters as an IdentityEscape, though these are permitted in string literals, so it would be a shame to add support if not strictly necessary.  Additionally implementations permit octal escapes in ClassEscapes - we should keep this in sync with whatever decision we make for octal in AtomEscapes.  Assuming some of the additional grammar proposed above (the BNF for octal, and splitting IdentityEscape for AtomEscape and ClassEscape) the following may be a workable solution if we need to support these features:

ClassEscape ::
	Zeros OptionalLatin1OctalValue

ClassIdentityEscape :: SourceCharacter but not ControlEscape nor CharacterClassEscape nor any of:
	0 b c x u

I'd be keen to hear your thoughts on all this.  I'll continue investigating what parts of the extended syntax are truly necessary from a compatibility perspective, and expect I should start moving this conversation onto a wiki page at some point.


p.s. One unrelated missing syntax error I've noticed is on quantified parenthetical assertions – browsers seem to permit /(?=)*/, which should be a syntax error.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the es-discuss mailing list