Inline regexps and caching

Richard Cornford Richard at litotes.demon.co.uk
Sat Jan 24 22:56:22 PST 2009


Brendan Eich wrote:
> On Jan 24, 2009, at 5:42 PM, Richard Cornford wrote:
>> Ugly, and an example of using a hammer to crack a nut.
>
> I do this all the time, works great ;-).

And I never do it, and that works great too.

> Seriously, there's more afoot than can be patched by
> resetting lastIndex.

My intention was to suggest that the 'prettiest' solution was to delete 
the superfluous 'g' from the end of the regular expression literal.  But 
resetting the - lastIndex - prior to using the regular expression object 
would also eliminate the undesirable behaviour in Laurens Holst's code, 
and has the merit of directly addressing the characteristic of regular 
expressions that results in the issue.

I admit, though, that the - new RegExp - thing is a bit of a bugbear for 
me. There are two reasons for that. The first is that I have encountered 
orders of magnitude more issues arising from people failing to cope with 
the 'double escaping' needed in string literal arguments for the RegExp 
constructor than issues following from the handling of - lastIndex -. So, 
for example, if you want to match a dot or a backslash in a regular 
expression they will need to be escaped by preceding them with a 
backslash, but in the string literal that backslash needs to be escaped 
with a second backslash if the RegExp constructor is going to see it (and 
in the case of matching the backslash that also needs to be escaped for in 
string literal). People just seem to make a lot of mistakes when being 
required to do that, and those mistakes don't seem to be easy to spot as 
the resulting regular expressions still 'work', even to the extent of 
sometimes making some 'correct'/expected matches.

The second reason is that the construct is often proposed without 
explanation and so can be received as a mystical incantation to be chanted 
in the face of every regular expression regardless of whether it is 
achieving anything useful in context. And so you encounter things like:-

...
 format: function(s) {
  return $.tablesorter.formatFloat(s.replace(new RegExp(/%/g),""));
 },
...
(from a JQuery table sorting plug-in)

- and end up wondering what on earth the author thought that - new 
RegExp - was supposed to achieve.

>> The issue is provoked by the fact that for a regular
>> expression with the global flag set the - exec - method
>> employs the regular expression object's - lastIndex - property,
>> leaving it set to the end index of the last match made. Knowing
>> that suggests that a simple 'solution' would be to explicitly
>> set the regular expression object's - lastIndex - property to
>> zero before using it. That must be cheaper than creating a new
>> regular expression object just for the side effect of then
>> having one with a zero - lastIndex - property.
>
> The more general problem is shared mutable literal-expressed
> singletons. In no other case (object or array initialiser,
> function expressions, primitive literals) does evaluation
> return the singleton created as if at parse time. Mutation
> hurts, sharing should be explicit.

All of that is true, and making sure the next language version eliminates 
that is a good idea. But that does not help people who have to address 
current ES 3 implementations.

> To match the other kinds of literals and avoid bugs such as
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=474412

Now that is an issue that relates to the identify of regular expression 
objects, and so can only be addressed by creating distinct objects with - 
new RegExp -.

> Efficiency concerns are secondary but can be addressed by
> lightweight cloning of a shared-immutable compiler-created
> regexp.

"Can be addressed by ...", 'will be addressed by ...' and 'MUST be 
addressed by ...' are all very different things. It is not in the remit of 
the new specification to be requiring specific optimisations in future 
implementations.

>> In addition, knowing the mechanism also directs attention
>> towards the global flag; does the regular expression being
>> used need to have the global flag set in the first place?
>> If the flag is not set then subsequent - exec - uses will
>> always start at the zero index. The example regular
>> expression used above only appears to be interested in
>> making a single match so probably there was never a need
>> to have the flag set.
>
> This is an optimization challenge for implementors, not a
> reason to specify a shared singleton with mutable state
> (lastIndex is mutable and set to 0 even without the 'g' flag).

I am not saying that there should be a shared singleton. In the situation 
as we have it now there are implementations that create regular expression 
literals while parsing, and others that create them when the expression is 
evaluated. So it is not possible to rely on the former or expect the 
latter. The result is a minefield that needs to be cleaned up. But in the 
meanwhile bulldozing all regular expression uses with - new RegExp - seems 
an extreme alternative to recognising the few that can blow up in your 
face and diffusing them individually.

Richard Cornford. 



More information about the Es-discuss mailing list