Quantum Flow Engineering Newsletter #2

Ehsan Akhgari ehsan.akhgari at gmail.com
Thu Mar 16 15:07:19 UTC 2017

Hi everyone,

This past week was another busy week chasing down performance issues in
Firefox.  We managed to knock out a few issues, get closer to close out a
couple of really high impact ones, and are making good progress on starting
to get performance data from telemetry which will hopefully allow us to
prioritize our efforts in a systematic way in order to focus on issues that
hurt our users the most in the wild first.

Another nice aspect that we are starting to get some traction on is scaling
up the engineering side of the project.  Jean Gong has started to help out
with the project management side of things, and we have started to triage
the list of bugs that we have, with the goal of identifying our highest
priority bugs to ensure that they all have assignees and are being worked
on and won’t fall into the cracks.  We appreciate your help if someone
approaches you asking for help with fixing, code reviews, or answering a
question about one of these bugs!

There is a work week for Quantum Flow on the week of March 27 here in
Toronto.  We’re preparing to meet face to face for the second time for this
project.  One of the things that I’m trying to have ready in time for this
work week is telemetry data about where Firefox is performing really badly
in the wild so that we can focus there first.  Right now we have Background
Hang Reports data that can collect a backtrace of hung threads in two
modes: if the thread is hung for more than 128ms, a backtrace using Gecko
Profiler pseudo stacks is captured, and if a thread is hung for more than 8
seconds, a backtrace using the full native stack is captured.  The pseudo
stack backtrace doesn’t include a lot of information, the backtrace only
consists of the manual annotations that we have added to the source code
using PROFILER_LABEL annotations.  I have already skimmed over the former
set of data and it’s really hard to gather much meaningful information from
this data.  The native stack traces would be much more useful, but while 8
seconds of a thread being hung is really bad, that’s more of a hang
scenario than a badly performing browser, so we’re trying to reduce this
threshold in bug 1346415 to gather better data here.  I hope to have some
more information to share about this next week.

Now, time for the performance story of this week, page navigations!  As web
browser makers, we talk about page load times a lot, and we all have heard
of what usually gets talked about in this context many times.  I’m going to
talk about what usually doesn’t get talked about though: what can happen in
the real life when you navigate from page A to B.  Firstly, with multiple
content processes, we may need to start a new content process for the
navigation.  Right now when a content process starts up, it sends a number
of synchronous IPC messages to the parent processes in order to initialize
various components (although we have removed all except for the last few
remaining ones.)  This is especially bad since at this time the parent
process is typically busy doing other work.  For example, since the kind of
navigation that results in a process switch typically happens in a new
tab/window, the parent process is typically busy opening a new tab/window,
and because of that, in really bad cases I have seen these synchronous
message take an overall time of over a second of the content process just
being paused doing no work whatsoever.  This can slow down navigations
significantly.  There is also a synchronous IPC that is on the path of all
navigations (bug 1337064) where we run this risk on all navigations.  We
also do some synchronous IPCs if the navigation results in an error page
under some situations, which is of less concern since those are less common
(well, one would hope at least.)

Fixing each one of these doesn’t mean that navigations suddenly become
faster of course, the logic works more against us than in our favour: not
fixing them means that we will always run the risk of page navigations
being slow in Firefox due to unpredictable factors.  What’s really worrying
is that in general it’s really hard to know what performance cliffs like
these are going to be on the path of any critical user interaction, and
these issues have a way of creeping in over time.  This is why a while ago
we decided to disallow the addition of new synchronous IPC messages by
default (bug 1336919) to avoid programmers adding more issues of this
nature to the code base.  We may still decide to add a few more of these
messages here and there, but only after really careful consideration and
measurement.  Like most other things in engineering, this requires careful
thought and balancing, but it’s good to have default practices that don’t
result in potentially disastrous performance cliffs.  Next week, I’m going
to give you another example of one of these cliffs showing how through an
unintended consequence of matters, code that was trying to avoid doing
main-thread I/O was ending up blocking not one, but three threads, to do
the said I/O!

Now, on to the credits section.  I’d like to take a moment to recognize the
work of the following individuals who have helped with various aspects of
the Quantum Flow project.  Thank you very much for your help this past
week!  (Apologies to those who I’m probably forgetting to name here.)

* Kan-Ru Chen’s patches for bug 1194751 (moving PScreenManager off of sync
IPC) are still under review.
* Amy Chung submitted a first iteration patch for bug 1331680 (moving
document.cookie off of sync IPC) for feedback.
* Kearwood (kip) Gilbert has been helping with removing various sync IPC
messages used in the WebVR implementation (bug 1344216 dependencies).
* Boris Zbarsky made various input/textarea selection management APIs
faster in many cases (bug 1343275, bug 1332036, bug 1342197).
* Greg Tatum and Markus Stange’s improvements to https://perf-html.io/
profiler UI significantly improve the responsiveness of the interface,
making it much easier to look at profiles.
* Nicholas Nethercote has been helping by fixing various threading, race
and deadlock issues in the Gecko Profiler backend.
* Michael Layzell has been helping with various telemetry data collection
* Mike Conley has been teaching me how to use our telemetry analysis

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/firefox-dev/attachments/20170316/546259c3/attachment.html>

More information about the firefox-dev mailing list