<div dir="ltr">Hi everyone,<br><p>Another week, another Quantum Flow engineering newsletter! We have a lot to cover, so let me get started.</p><p>Michael Layzell is getting really close on his work on <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1346415">bug 1346415</a>
in order to collect native stacks from Background Hang Reports through
telemetry on Nightly. There are several practical concerns around this
data collection, things such as not blowing up our telemetry ping size,
and also the processing of this data on the server side, and we have
some ideas on how we can improve this in the future. Since some data is
better than no data, we're trying to start with having each client send
a maximum of 300 of these native stacks in each ping to begin with, and
will hopefully grow this limit in the future to be able to collect more
data. He has also been helping with writing some scripts for
post-processing this data so that we can have an automatically generated
nightly report set from these pings to triage. The triage itself, of
course, will be a manual, excruciating (read: "fun"!) process for now,
until we think of something better.</p><p>We have finished an initial round of triage of the Quantum Flow bugs. We are using a few tags, which are all described <a href="https://docs.google.com/document/d/1Ka8eNAISQodT1mS_OXapFG-_kk94GoXyo4eKH1j7EV4/edit#heading=h.g074di4nyf2m">here</a>.
The most important bug tag to pay attention to at this point is [qf:p1]
in the status whiteboard field. This tag means we believe this bug may
have a large impact on performance, and it needs to be fixed <em><strong>now</strong></em>.
We try our best to make it obvious why we believe this to be the case,
and of course not all [qf:p1] bugs are all of the same level of
importance, but if you believe there is strong evidence why a [qf:p1]
bug isn't of utmost importance for performance, please feel free to
raise the issue on the bug, it's best to correct any possible triage
mistakes as soon as we can. Otherwise, we really appreciate your
assistance in addressing these bugs. Note that we are dealing with a
massive project (making the entire web browser faster for all users in
all usage scenarios) under a very strict timeline (by Firefox 57!) and
the longer we let these bugs live in Firefox, the longer they can mask
smaller and less severe performance issues, putting the entire effort at
risk.</p><p>Next week we are going to have a work week around Quantum
Flow in the Toronto office. There are many people attending from
different parts of Mozilla and it's going to be a really exciting and
super packed week. Several things excite me personally. I expect to
spend some more time profiling and delving down into technical issues. I
also expect to spend some time talking to people on various teams about
how we can facilitate getting more help from even more engineers on
fixing the bugs that we are finding. One of my goals is to make the
bottleneck of our pipeline be the discovering of new issues to fix, and I
hope to get closer to achieving that after next week. Another exciting
thing happening next week is that we have some members from the Quantum
DOM team also attending the work week (including myself, as I'm still
involved in that project as well.) We're hopefully going to have a more
concrete plan around cooperatively scheduling of JavaScript running on
web pages, which is a really important part of the overall picture of
the improvement of the performance of the browser. I don't expect to be
able to send out one of these newsletters next week though, so expect
the next one in two weeks!</p><p>Now I want to talk a bit about our
synchronous IPCs. I've talked about them before, but they deserve more
air time, as based on the data we have so far, they are one of our
biggest performance issues at this point. I have been thinking about
good ways of making the extent of the problem more obvious. We already
have <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=SyncIPC">a tracker bug</a>,
and some people have been helping with a few of these bugs (see below),
but I still think our progress on this issue could be better. So let's
open up this closet and take a look at our skeletons, shall we?</p><p>I have prepared a <a href="https://docs.google.com/spreadsheets/d/1x_BWVlnQPg0DHbsrvPFX7g89lnFGa3lAIHWD_pLa_dE/edit#gid=844442583&fvid=785100780">Sync IPC Report for 2017-03-23</a>.
It's a spreadsheet, with a chart! So cool. The first thing you'll
notice is that I'm not great at data visualization. :-) With that out
of the way, let's look at the data. We could sort this data in various
ways, but I have chosen to stick to something super simple, sort it in
descending order of median time of the sync IPC times the number of
times it happens in the wild. You can inspect the data yourself, but
here is a human readable summary of where we are now:</p><ul><li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1331680">PCookieService::Msg_GetCookieString</a>
(aka, what happens when a page calls document.cookie!) at 34%. This is
the most horrible sync IPC that we have (and it's one of the most
popular APIs on the web.) Amy Chung is actively working on fixing this,
and Josh Matthews is helping her with providing feedback on her patch.
Thanks to you both!</li><li>PContent::Msg_RpcMessage and
PBrowser::Msg_RpcMessage at 26.9%. These two are together forming a big
bucket consisting of all of the sync IPCs triggered from JS. In order
to stop flying blind here, <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1348113">bug 1348113</a>
was filed to collect specific telemetry on this bucket. I recently
found out that a page calling navigator.userAgent to do UA sniffing
(which is also super common) <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1347425">can result in sync IPC</a> that happens through JS and this stayed hidden from us for a long time in this telemetry data...</li><li>A number of <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1194751">PScreenManager sync IPC messages</a> at 12.8%. Kan-Ru Chen has done some amazing work to fix all of them, and the patch set is really close to landing any day now.</li><li>Then there is a bit of a longer tail, and I have looked at some of them in some detail:<ul><li>CPOW
overhead: basically PJavaScript and anything under it. Some of this
could be caused by add-ons that aren't e10s compatible yet. I need to
investigate more to get a better sense of how true this statement is!</li><li>Graphics initialization sync IPCs: <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1337062">PContent::Msg_GetGfxVars</a> and <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1337063">PContent::Msg_GetGraphicsDeviceInitData</a>.
These should be easy to fix but we've had a bit of a difficult time
getting help in fixing them. Gerald Squelart has recently stepped up
to the task, thanks Gerald! These are important for navigation
performance, as I mentioned in my previous newsletter.</li><li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1343728">PContent::Msg_CreateWindow</a>.
This one also has a pretty bad impact on navigation, even when we don't
need to start a new content process! I have a patch that fixes this
enough to make things work for basic browsing, but it's far from passing
tests still...</li></ul></li></ul><p>If you see an IPC message on this
list that looks familiar to you and doesn't have a bug that tracks
fixing it already, please feel free to file one. If you are familiar
with an area of the code where one of these messages is being used,
please consider fixing one or two. :-)</p><p>Now, it's time for our
performance story of the week! This time we're going to look at how not
to do off-main-thread I/O. Usually when people talk about avoiding
main thread I/O, the goal is to make it so that the main thread doesn't
end up calling a function that could end up being blocked until the
(potentially spinning) disk finishes an I/O operation. Typically this
is done in one of the two ways, either using a non-blocking I/O API that
the underlying OS provides (to get the OS to call you back when the I/O
is finished) or make a background thread call the mentioned function,
and notify your main thread itself. In our implementation of the
XMLHttpRequest in Gecko, in order to support the blob response type, we
need to open a temporary file to write the incoming data to. Opening
this file is an I/O operation, and we use the second strategy in order
to avoid a main-thread I/O. Now, it turns out that we had <a href="http://searchfox.org/mozilla-central/rev/a5c2b278897272497e14a8481513fee34bbc7e2c/dom/file/MutableBlobStorage.cpp#123">this code</a>
which was expecting NS_OpenAnonymousTemporaryFile() to fail in the
sandboxed content process where, the author expected, opening the
temporary file handle would fail. But then, that wasn't what that
function was doing at all! That function was doing all in its power to
do what the caller asked it to, that is, to open an anonymous temporary
file. The way that the function <a href="http://searchfox.org/mozilla-central/rev/2d24acd7f3e087c5a506f325684487013e1f1744/xpcom/io/nsAnonymousTemporaryFile.cpp#118">did it</a>
in the content process in a background thread was to dispatch a
synchronous runnable to the main thread, blocking the calling thread (in
this case, the Gecko IO thread) and then <a href="http://searchfox.org/mozilla-central/rev/2d24acd7f3e087c5a506f325684487013e1f1744/xpcom/io/nsAnonymousTemporaryFile.cpp#100">dispatching a synchronous IPC message</a>
to the parent process. At this point, two threads would be blocked in
the content process. As if that weren't enough, the handler for the
sync IPC in the parent process would then <a href="http://searchfox.org/mozilla-central/rev/a5c2b278897272497e14a8481513fee34bbc7e2c/dom/ipc/ContentParent.cpp#4063">call the same function</a> on the parent process main thread leading to <a href="http://searchfox.org/mozilla-central/rev/2d24acd7f3e087c5a506f325684487013e1f1744/xpcom/io/nsAnonymousTemporaryFile.cpp#160">main-thread I/O</a>
on our UI thread! Of course, all of this was the unintended
interaction of different parts of the code when combined together, and
I'm glad to report that this is all now <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1347031">fixed</a> on Nightly. :-)</p><p><br></p><p>Last
but not least, time for the credits section again. I would like to
thank the following individuals for their help in making Firefox faster
this past week. As always, apologies to those who I'm forgetting to
name here.</p><ul><li>Kris Maglione did some <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1333990">heroic work</a>
to avoid reparsing our content scripts every time we run them. This
was a pretty severe performance issue that impacts a lot of add-ons that
rely on content scripts, but fixing it wasn't very easy, and honestly
when the bug was filed I wasn't very hopeful to see it fixed any time
soon given the amount of work that was involved.</li><li>Sam Foster has been attacking a <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1334642">synchronous reflow</a>
that can happen when we (de)activate a browser window. The work in
ongoing, but these types of front-end bugs, even though they may not be
much fun to work on, are very important to fix and can remove a lot of
jank that we won't be able to get rid of in any other way. Thank you
Sam!</li><li>Mike Conley landed <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1340842">some instrumentation</a> for tab closing. In case you're wondering, this means we're taking tab closing performance <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1344302">very seriously</a>.</li><li>Mike Conley also made us <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1256472">create the about:blank placeholder document for lazily restored tabs after a session restore in the content process</a>.
If that sounds boring, how about this: he improved session restore
times for users with hundreds of tabs by a lot. Users are reporting
improvements on the scale of <em><strong>minutes</strong></em> (you read that right.)</li><li>Mike de Boer has been helping with triaging some <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1330635">session restore performance bugs</a>.</li><li>Kearwood (kip) Gilbert has been continuing his work on removing the <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1346923">synchronous</a> <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1346926">IPCs</a> used in the WebVR implementation.</li><li>Michael Layzell <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1337056">removed a synchronous IPC</a>
which was used to initialize the permission manager's database. As an
additional privacy win, the content process now only knows about the
permissions belonging to the websites that you have visited, not all of
the permissions stored in your profile!</li><li>Michael Layzell also added <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1342635">telemetry for IPC message serialization/deserializaion</a>
that happens on the main thread. There's some evidence that this can
be expensive, and this probe will help us find the IPC messages where
this can be problematic in the wild.</li><li>Chris Pearce made <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1347031">media cache initialization use asynchronous IPC</a>.</li><li>Jeff Muizelaar <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1346585">removed an async pan/zoom logging message</a> which was slowing us down to log information that nobody was looking at!</li><li>Olli Pettay brought the <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1339758">performance of accessing MouseEvent.offsetX/Y on simulated click events</a> on par to other engines.</li><li>Edgar Chen and Boris Zbarsky worked on <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1347634">a</a> <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1347639">few</a> <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1347640">optimizations</a> for improving our innerHTML setter performance.</li><li>Henry Chang fixed a <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1325054">severe UI jank</a> that could occur when using tracking protection (for example in private browsing windows).</li></ul><div><br></div><div>Until next time, happy hacking!<br></div><div>-- <br><div class="gmail_signature"><div dir="ltr">Ehsan<br></div></div>
</div></div>