TMO Stability Dashboard is Changing
Brendan Colloran
bcolloran at mozilla.com
Mon Mar 6 19:57:07 UTC 2017
"""I don't actually have a proper model for when we're allowed to trust
crash rates. When is the calculated crash rate actually indicative of a
release's health? Open question. And one whose answer changes over time,
with pingSender changing the speed at which we receive crucial inputs."""
"""Can you describe what other work we need to get to a high confidence?
Especially if there is analysis/statistical help you need, Saptarshi
already has a lot of context. I want to make sure that we in relatively
short order *can* answer this in a way that release-drivers can trust to
make critical ship or no-ship decisions."""
Great conversation, important issues. I'm going to repeat a slogan that
I've been saying for years, because recent progress at Mozilla indicates
that doing so has actually been effective ;-)
**Anytime we release code without doing so as a real randomly controlled
experiment, we're doing an uncontrolled experiment on our entire user
population**
It is not a mistake that Randomized Controlled Trials are the gold standard
for *nailing down* causal relationships in Science. Over the past I dunno
18 months-ish, Mozilla has gotten way better at using RCTs to gain real
clarity about the impact of changes, but we still don't do it for
everything... the final frontier is doing RCTs for entire builds/releases.
As you all are very well aware, when you are watching things like crash
numbers or start-up times bounce around from day to day and version to
version, it's super hard to get clarity about what is causing the bounces--
there are a ton of confounding variables. At a minimum:
- there is the effect of builds/releases, which is the thing we can
control, and hence the thing whose effect we really actually care about
- there are seasonal and weekly cycles that are pretty well understood, but
which (as anyone has looked at this stuff knows...) can still be pretty
messy, and which impacts different parts of out population differently
(e.g. heavy vs light users)
- there are random shocks to the the signal that are caused by some change
in the software ecosystem on the client (like the DLL injection problem
Harald mentioned) or by a change to some prominent website
- there are random shocks to the system that happen because of human
behaviors-- news cycles etc changing browsing habits on short time scales
And so I say this now as a statistician who believes in the value of good
analytical/statistical work: if you *really* want to get to high confidence
about the health of a release, the *only* way to do so is to reframe
releases as RCTs. There is simply too much noise in our timeseries signals,
and we don't have access to enough of the exogenous explanatory variables
to be able to control for that noise. This is not an issue where you can
just pour more stats on it and it will go away; it's irreducible, and it's
why all of Science uses RCTs whenever possible (and also why ton's of
software companies do...).
Luckily, we control our whole system, so for us it _is_ possible. It would
require some thought about the details of the implementation, and it would
surely take a ton of work from release management, but if we rolled version
out as A/B tests with randomly assigned update vs non-update, we could know
*for sure* what impact new code has on crashes, startup time, engagement,
retention, browser perf, etc etc. [Note: throttling releases is good for
smoke testing this stuff, but it's not the same as real randomization]
Anyway, I'm sure there are a million reasons why people won't want to make
a big investment in rethinking giant portions of how release management
works, so in the meantime i'll just leave this suggestion here to percolate
in your brains. Maybe in another two years we'll be sick enough of all this
uncertainty to make that investment; when we get to that point, ping me,
joy, ilana, etc and we can work on the details ;-)
-bc
On Fri, Mar 3, 2017 at 1:52 PM, Benjamin Smedberg <benjamin at smedbergs.us>
wrote:
> Yes, I'm aware of the fundamental difficulty and that's one of the reasons
> we're prioritizing pingsender.
>
> Can you describe what other work we need to get to a high confidence?
> Especially if there is analysis/statistical help you need, Saptarshi
> already has a lot of context. I want to make sure that we in relatively
> short order *can* answer this in a way that release-drivers can trust to
> make critical ship or no-ship decisions.
>
> -BDS
>
>
> On Wed, Mar 1, 2017 at 3:18 PM, Chris Hutten-Czapski <chutten at mozilla.com>
> wrote:
>
>> "How quickly can we get from [a release] to a reliable crash rate?"
>>
>> Well, if you believe my analysis[1] (which you may, if you'd like) the
>> answer is "at least a day out, but probably best to wait at least a day
>> longer than that"
>>
>> But aside from shameless self-promotion, there's the real concern that I
>> don't actually have a proper model for when we're allowed to trust crash
>> rates. When is the calculated crash rate actually indicative of a release's
>> health? Open question. And one whose answer changes over time, with
>> pingSender changing the speed at which we receive crucial inputs.
>>
>> :chutten
>>
>> [1]: https://chuttenblog.wordpress.com/2017/02/09/data-science-is
>> -hard-client-delays-for-crash-pings/
>>
>> On Wed, Mar 1, 2017 at 3:02 PM, Benjamin Smedberg <benjamin at smedbergs.us>
>> wrote:
>>
>>> Think of the per-build data like this:
>>>
>>> * our crash rate for FF53 b2 is too goddamn high!
>>> * We pulled a topcrash list and found a regression bug
>>> * We fixed it and uplifted it to FF53b4
>>> * Release drivers want to make sure that FF53b4 has the crash rate
>>> reduction that we expected.
>>>
>>> In this case (which is very common on all the prerelease channels),
>>> showing by date and not by build smooths out the signal we care about most,
>>> which is the difference per-build. So what release drivers care about most
>>> is: how quickly can we get from releasing beta8 (or RC1) to a reliable
>>> crash rate that says we're clear to ship this to release?
>>>
>>> --BDS
>>>
>>>
>>> On Wed, Mar 1, 2017 at 2:45 PM, Chris Hutten-Czapski <
>>> chutten at mozilla.com> wrote:
>>>
>>>> Thank you for correcting my mistake about build_id. Apparently when I
>>>> was looking for readable versions I took the lack of a bX suffix on beta
>>>> builds to mean there was no build data, but that was wrong then and wrong
>>>> now.
>>>>
>>>> So yeah. No problemo, apparently. (Well, except my misapprehension,
>>>> which should now be rectified.)
>>>>
>>>> Anyway, back to wishlisting... The request I can most easily understand
>>>> is for the existing display to instead use the current and N previous
>>>> _builds'_ crash counts and kuh to form a "channel health trend line". N may
>>>> be some fixed, small number per channel (maybe 2, 6, 14, 21), or some
>>>> function of usage (take all successive previous builds until they contain >
>>>> Y% of that activity_date's kuh).
>>>>
>>>> This sounds fun and useful and is something I feel I understand well
>>>> enough to file an Issue for and begin working on.
>>>>
>>>> The per-build case on the other hand I feel I understand less well. Do
>>>> the pseudo-timeseries plots as seen on sql.tmo work as well on channels
>>>> with fewer builds? If instead we display the crash rate for one build as
>>>> one number (per type of crash), does it need to be plotted at all? Is
>>>> Harald's view of 52b5 actually a problem? The best display for
>>>> understanding the health of a new release is...
>>>>
>>>> :chutten
>>>>
>>>> On Wed, Mar 1, 2017 at 2:08 PM, Benjamin Smedberg <
>>>> benjamin at smedbergs.us> wrote:
>>>>
>>>>> On Wed, Mar 1, 2017 at 1:40 PM, Chris Hutten-Czapski <
>>>>> chutten at mozilla.com> wrote:
>>>>>
>>>>>> So, for each channel, having a line for each of the current and N
>>>>>> previous versions' crash rates would be helpful? (where N is small... say 2)
>>>>>>
>>>>>
>>>>> It's closer. If it's possible to have a single line that is the
>>>>> aggregate, that might smooth out adoption noise.
>>>>>
>>>>>
>>>>>>
>>>>>> crash_aggregates does its aggregation by version, not build, so if
>>>>>> crash-rates-per-build is necessary this will require quite a rewrite. (I'll
>>>>>> have to join main_summary and crash_summary on dates) If
>>>>>> crash-rates-per-version is sufficient (or is worth the effort of
>>>>>> exploring), then that can be adopted within the existing architecture.
>>>>>>
>>>>>
>>>>> This is incorrect. build_id is one of the dimensions, specifically
>>>>> because we needed it for betas with e10s. Example from stmo:
>>>>>
>>>>> activity_date dimensions stats submission_date
>>>>> 24/08/16{"build_id":"20150804030204","os_name":"Windows_NT",
>>>>> "os_version":"6.2","country":"NP","application":"Firefox","a
>>>>> rchitecture":"x86-64","build_version":"42.0a1","channel":"ni
>>>>> ghtly","e10s_enabled":"False"}{"usage_hours_squared":0.01585
>>>>> 779320987654,"main_crashes":0,"content_shutdown_crashes":0,"
>>>>> usage_hours":0.15416666666666667,"content_shutdown_crashes_s
>>>>> quared":0,"gmplugin_crashes":0,"content_crashes":0,"content_
>>>>> crashes_squared":0,"plugin_crashes":0,"plugin_crashes_square
>>>>> d":0,"ping_count":3,"gmplugin_crashes_squared":0,"main_crash
>>>>> es_squared":0}25/08/16
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> One thing to consider may be that crash-rates-per-build will look
>>>>>> messy on anything pre-Beta.
>>>>>>
>>>>>
>>>>> Could be, but here's an example way to present this for nightly:
>>>>> http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-da
>>>>> y-empty-minidump-crashes-per-user/
>>>>>
>>>>> Here's also a draft of something a while back in STMO:
>>>>> https://sql.telemetry.mozilla.org/queries/192/source#309
>>>>>
>>>>> --BDS
>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> fhr-dev mailing list
> fhr-dev at mozilla.org
> https://mail.mozilla.org/listinfo/fhr-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170306/b28c6a6a/attachment.html>
More information about the fhr-dev
mailing list