A tale of OOMs and thumbnails: the abstract
rnewman at mozilla.com
Mon Nov 4 13:39:21 PST 2013
There are a few abstract observations that resulted from the concrete problems in my last message. They essentially boil down to: how can we protect ourselves from our own code? Or, phrased differently: how can I avoid introducing bugs in the future?
I am very keen to hear ideas on how to shift the odds in our favor. If we can gather some recommendations that we agree on, I'll write them up (and add them to the reviewer guidelines).
Firstly, we ended up with a bunch of OOMs on mid-to-high-end devices because we made some valid assumptions on smaller devices.
On an older RAZR device, with 248-pixel-wide thumbnails, jumping to the next power of two is fine. Arguably on *any* device, you don't want to take a screenshot smaller than 256 or 512px.
Even if it blows up one power too far, it's not 2.3MB. And the larger the screenshot, the prettier it is when scaled down. That decision was valid, right up until we crossed the 512px column width threshold.
I also heard opinions in IRC that this size didn't matter, because of course that was just the capture size, which any device can do for just a moment, right? That assumption wasn't borne out by the code: we store the full-size image (PNG-compressed) in the DB and extract it on launch, and we hold a reference to the full-size bitmap rather than scaling it down and throwing away the larger version.
We make decisions like this all the time: buffer sizes, batch sizes, preferred image dimensions, maximum cache thresholds. As far as I know, we don't have an institutionalized mechanism for forcing ourselves to be aware of these decisions. Imagine if we had a test that verified that our thumbnails were never more than a sanity threshold -- in 2010 that sanity threshold would have been maybe 500KB, and each time that test failed (e.g., when we first hit xhdpi devices), with a nice descriptive error message, we'd be forced to reassess whether we should raise the limit, change our approach, or do something else.
When you write or review code that contains an explicit or systemic limit, think about whether you want to add a test, or a runtime check, to call attention to a broken assumption. If nothing else, this will force you to think about the implicit assumptions you're making. Perhaps the right outcome is a telemetry probe that reports cache fullness, or cache resize events. Perhaps it's a test that just checks the assumption. At the very least it ought to be a comment.
(And I suppose the limit applies in the other direction, too: on 2015's 16GB handsets, is today's 512KB cache size a good tradeoff?)
Secondly, it doesn't seem like we're paying as much attention as we should to memory usage… else we'd have spotted this before on our daily-driver phones. I strongly encourage people to run Fennec with `monitor` attached. You can watch incremental memory allocations, and take heap snapshots and analyze them. Hell, I learn a lot just by watching the full adb log scroll by.
(See <https://github.com/rnewman/android-quiet-logs> if you want to do something noisier but more valuable than `fgrep Gecko`.)
I won't go so far as to say that this should be a pre-commit hook, or even a mandatory review step, but I know I'm going to keep this in mind for my next review or patch.
I filed a meta bug off which to hang bugs that result from memory analysis. Next time you're testing by hand, and you have a non-empty profile, try grabbing a memory snapshot and familiarizing yourself with the interface. Maybe you find that some feature is holding on to a reference and leaking memory… or maybe you watch the incremental allocator and wonder why it cost two megs to open a menu. File a bug!
Thirdly, our code architecture is difficult to understand (e.g., with favicon and thumbnail logic spanning tabs, listeners, the favicon system itself, the DB, bits of Gecko, …). That leads to incremental changes that orphan code: it's cargo-culted forward between revisions. It was enormously valuable to me to have pointers to bugs explaining *why* we did one thing and not another. Fortunately I have an obstreperous and questioning personality!
It also leaves a lot of opportunity for memory leaks -- a foo holds on to a bar, the bar holds on to a map, the map contains tasks, the tasks hold on to … . When you need six buffers open to figure out who owns what, you're in trouble.
I'd love for us to spend a little of our copious free time pushing towards simplicity -- pushing related logic together, reducing coupling, avoiding the temptation to just fling another runnable onto a background thread and mutate another instance's member variables. It'll pay off in the long run.
More information about the mobile-firefox-dev