RegExp syntax suggestion: allow CharacterClassEscape in CharacterRange
Gavin Barraclough
barraclough at apple.com
Wed Dec 8 12:43:06 PST 2010
According to the ES5 spec a regular expression such as /[\w-_]/ should generate a syntax error. Unfortunately there appears to be a significant quantity of existing code that will break if this behavior is implemented (I have been experimenting with bringing WebKit's RegExp implementation into closer conformance to the spec), and looking at other implementations it appears common for this error to be ignored.
The parsing of this expression matches a single NonemptyClassRanges of the form "ClassAtom - ClassAtom", where the first ClassAtom is a CharacterClassEscape and the second a SourceCharacter. Per section 15.10.2.15 of the spec this calls CharacterRange, resulting in this syntax error:
1. If A does not contain exactly one character or B does not contain exactly one character then throw a SyntaxError exception.
I'd like to propose a minimal change to hopefully allow implementations to come into line with the spec, without breaking the web. I'd suggest changing the first step of CharacterRange to instead read:
1. If A does not contain exactly one character or B does not contain exactly one character then create a CharSet AB containing the union of the CharSets A and B, and return the union of CharSet AB and the CharSet containing the one character -.
This is roughly equivalent to implicitly escaping the hyphen in any invalid range*, so /[\w-_]/ is treated as /[\w\-_]/. I believe this change would have a low impact on the spec, that it should be feasible for implementors to easily adopt this behavior, and that this should commonly be compatible with existing code that is currently not spec compliant.
many thanks,
Gavin
[ * However this is not exactly equivalent to treating the hyphen in an invalid ranges as having being escaped. Consider /[\d-a-z]/. Escaping the hyphen in the invalid range would give the expression /[\d\-a-z]/, in which case a-z would be matched as a CharacterRange. This would arguably be a more intuitive interpretation of the expression, but changing the language to match this would require a more intrusive change to the grammar, which I'm assuming would not be desirable. ]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20101208/69f3b9da/attachment.html>
More information about the es-discuss
mailing list