Reacting more strongly to low-memory situations in Firefox 25

Benjamin Smedberg benjamin at smedbergs.us
Mon Nov 25 17:02:50 UTC 2013


In crashkill we have been tracking crashes that occur in low-memory 
situations for a while. However, we are seeing a troubling uptick of 
issues in Firefox 23 and then 25. I believe that some people may not be 
able to use Firefox because of these bugs, and I think that we should be 
reacting more strongly to diagnose and solve these issues and get any 
fixes that already exist sent up the trains.

Followup to dev-platform, please.

= Data and Background =

See, as some anecdotal evidence:

Bug 930797 is a user who just upgraded to Firefox 25 and is seeing these 
a lot.
Bug 937290 is another user who just upgraded to Firefox 25 and is seeing 
a bunch of crashes, some of which are empty-dump and some of which are 
all over the place (maybe OOM crashes).
See also a recent thread "How to track down why Firefox is crashing so 
much." in firefox-dev, where two additional users are reporting 
consistent issues (one mac, one windows).

Note that in many cases, the user hasn't actually run out of memory: 
they have plenty of physical memory and page file available. In most 
cases they also have enough available VM space! Often, however, this VM 
space is fragmented to the point where normal allocations (64k jemalloc 
heap blocks, or several-megabyte graphics or network buffers) cannot be 
made. Because of work done during the recent tree closure, we now have 
this measurement in about:memory (on Windows) as vsize-max-contiguous. 
It is also being computed for Windows crashes on crash-stats for clients 
that are new enough (win7+).

Unfortunately, often when we are out of memory crash reports come back 
as empty minidumps (because the crash reporter has to allocation memory 
and/or VM space to create minidumps). We believe that most of the 
empty-minidump crashes present on crash-stats are in fact also 
out-of-memory crashes.

I've been creating reports about OOM crashes using crash-stats and found 
some startling data:
Looking just at the Windows crashes from last Friday (22-Nov):
* probably not OOM: 91565
* probably OOM: 57841
* unknown (not enough data because they are running an old version of 
Windows that doesn't report VM information in crash reports): 150874

The criterion for "probably OOM" are:
* Has an OOMAnnotationSize marking meaning jemalloc aborted an 
infallible allocator
* Has "ABORT: OOM" in the app notes meaning XPCOM aborted in infallible 
string/hashtable/array code
* Has <50MB of contiguous free VM space

This data seems to indicate that almost 40% of our Firefox crashes are 
due to OOM conditions.

Because one of the long-term possibilities discussed for solving this 
issue is releasing a 64-bit version of Firefox, I additionally broke 
down the "OOM" crashes into users running a 32-bit version of Windows 
and users running a 64-bit version of Windows:

OOM,win64,15744
OOM,win32,42097

I did this by checking the "TotalVirtualMemory" annotation in the crash 
report: if it reports 4G of TotalVirtualMemory, then the user has a 
64-bit Windows, and if it reports either 2G or 3G, the user is running a 
32-bit Windows. So I do not expect that doing Firefox for win64 will 
help users who are already experiencing memory issues, although it may 
well help new users and users who are running memory-intensive 
applications such as games.

Scripts for this analysis at 
https://github.com/mozilla/jydoop/blob/master/scripts/oom-classifier.py 
if you want to see what it's doing.

= Next Steps =

As far as I can tell, there are several basic problems that we should be 
tackling. For now, I'm going to brainstorm some ideas and hope that 
people will react or take of these items.

== Measurement ==

* Move minidump collection out of the Firefox process. This is something 
we've been talking about for a while but apparently never filed, so it's 
now filed as https://bugzilla.mozilla.org/show_bug.cgi?id=942873
* Develop a tool/instructions for users to profile the VM allocations in 
their Firefox process. We know that many of the existing VM problems are 
graphics-related, but we're not sure exactly who is making the 
allocations, and whether they are leaks, cached textures, or other 
things, and whether it's Firefox code, Windows code, or driver code 
causing problems. I know dmajor is working on some xperf logging for 
this, and we should probably try to expand that out into something that 
we can ask end users who are experiencing problems to run.
* The about:memory patches which add contiguous-vm measurement should 
probably be uplifted to Fx26, and any other measurement tools that would 
be valuable diagnostics.

== VM fragmentation ==

Bug 941837 identified a bad VM allocation pattern in our JS code which 
was causing 1MB VM fragmentation. Getting this patch uplifted seems 
important. But I know that several other things landed as a part of 
fixing the recent tree closure: has anyone identified whether any of the 
other patches here could be affecting release users and should be uplifted?

== Graphics Solutions ==

The issues reported in bug 930797 at least appear to be related to HTML5 
<video> rendering. The STR aren't precise, but it seems that we should 
try and understand and fix the issue reported by that user. Disabling 
hardware acceleration does not appear to help.

Bas has a bunch of information in bug 859955 about degenerate behavior 
of graphics drivers: they often map textures into the Firefox process, 
and sometimes cache the latest N textures (N=200 in one test) no matter 
what the texture size is. I have a feeling that we need to do something 
here, but it's not clear what. Perhaps it's driver-specific workarounds, 
or blacklisting old driver versions, or working with driver vendors to 
have better behavior.

== Dealing with OOM crash sites ==

Currently we still have a fair number of call sites that crash with 
infallible allocation or after allocation failure where the allocations 
are potentially large or huge. In general, infallible allocation should 
only be used for fixed-size quantities (C++ classes). Any arrays where 
the count is controlled by content, or large buffers for graphics or 
networking data should be allocated using fallible allocators, 
null-checked, and the system should propagate failure.

I am working on generating some reports on existing crashes where 
OOMAllocationSize is variable, and also crash signatures that correlate 
highly with OOM conditions. We should fix these sites.

This is only a stopgap measure, because we see plenty of crashes where 
OOMAllocationSize is very small (56 bytes), but it will help keep the 
browser alive for longer and also foil some trivial DoS attacks.

== Regression ranges ==

Some of the issues appear to be recently introduced in Firefox 25. We 
need to jump on regression ranges ASAP. I could really use help working 
with users such as those identified at the top of this message to see if 
there are regression ranges in nightly builds that cause more issues.

== Last-ditch UI==

When contiguous VM starts getting low, we should probably warn the user 
and ask them to restart Firefox soon or risk crashing. I know that this 
sucks, but a warning before you crash at least gives you a chance to 
save things. I have filed this as 
https://bugzilla.mozilla.org/show_bug.cgi?id=942892

--BDS




More information about the firefox-dev mailing list