A Crash Trend Dashboard (wrt to crashes and clients)
Benjamin Smedberg
benjamin at smedbergs.us
Wed Mar 1 15:00:17 UTC 2017
On Tue, Feb 28, 2017 at 10:50 PM, Andre Duarte <aduarte at mozilla.com> wrote:
> Hi all,
>
> we built this dashboard with the intent of getting a summary of several
> crash rates and types, with a special focus on content crash rates since
> the introduction of Electrolysis.
>
> The increase in crash rates is expected, due to the correlation of content
> crashes with e10s adoption. As seen here: https://metrics.mozilla.com/pr
> otected/sguha/crashgraphs/#crash-rates-e10s, content crashes are the main
> difference between e10s and non-e10s users, while main and plugin crashes
> are not dissimilar between the two groups (click on the buttons on the
> right-hand-side to toggle the crash types to compare between the two
> groups). Therefore, I would suggest waiting until e10s adoption stabilizes
> in order to see whether crash rates even out as well.
>
This doesn't make sense to me. One of the release criteria for e10s, which
we carefully measured before release, was that the crash rate of e10s (main
+ content - contentshutdown) was no higher than the crash rate of e10s
(main).
If this chart is correct, it appears to say that main crash rates are
mostly unchanged with the adoption of e10s, which is very surprising and
unexpected. And that the adoption of e10s is leading to an overall increase
of crash rates because we haven't decreased the main crash rate but we have
introduced content crashes.
I'm concerned because this is showing very different results than the
official crash rates. In order for this to be useful for decision-making I
need you to work with mconley and chutten to make sure we're using the same
common definitions, that we've reviewed the data sources. At a minimum we
need to understand and document the discrepancies to avoid confusion. We
probably need to do the work to make sure that the shutdown crashes are
accounted properly in the dataset you're using, or that this switches to a
more appropriate dataset.
If this is intended to be something we keep long-term, we also need to
integrate this in with the existing official stability dashboard at
https://telemetry.mozilla.org/crashes/ and decide who is responsible for
maintaining it. Some of these metrics are definitely intended for long-term
use, such as the first-week crash rate and the heavy-user crash rate, since
those are related to OKRs for the year.
Once we have confidence in the baseline metrics about per-user crash rates
and crash distributions, I will set up a meeting including you with the
uptime team to discuss and figure out what (if any) actions we need to take
next to improve things. Our focus is supposed to be crashes that affect
first-week browsing and crashes that happen to heavy browsers, especially
repeatedly.
--BDS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170301/2819ce19/attachment-0001.html>
More information about the fhr-dev
mailing list