TMO Stability Dashboard is Changing
Robert Strong
rstrong at mozilla.com
Wed Mar 8 19:44:10 UTC 2017
Interesting concept though this is the first I have heard about it which is
a tad strange since I work on the client app update code. ;)
Since you've been advocating for this for some time do you have any work or
write ups on how this would be accomplished?
Thanks,
Robert
On Wed, Mar 8, 2017 at 11:40 AM, Saptarshi Guha <sguha at mozilla.com> wrote:
> I've been saying something similar for a long time now. Ii advocate for UT
> based rollout that can be controlled based on variables in UT and unify the
> experiments and release rollout thought processes..
>
> From an email I sent elsewhere (appropriate here)
>
> I also advocate (strongly!) doing releases based entirely on UT data. A
> release (and even an experiment) can be released to profiles meeting some
> criteria e.g. Release, geo:US, locale: en-us memory: low-end *and*
> sampleid1 in 1, sampleid2 in 41.
> Here sampleid2 are independent versions of sample_id (e.g.
> crc32(salt(client_id)) %% 100 ). That way we can release to as small as
> 60,000 profiles. Moreover we can keep a file with this release history
> (what filters, what sampleids chosen etc) and keep it in the UT json
> itself.
> The quality of the release can be viewed in two ways: effect on existing
> users (before vs after ) and test vs control.
>
> Cheers
> Saptarshi
>
>
>
>
> On Mon, Mar 6, 2017 at 11:57 AM -0800, "Brendan Colloran" <
> bcolloran at mozilla.com> wrote:
>
> """I don't actually have a proper model for when we're allowed to trust
>> crash rates. When is the calculated crash rate actually indicative of a
>> release's health? Open question. And one whose answer changes over time,
>> with pingSender changing the speed at which we receive crucial inputs."""
>>
>> """Can you describe what other work we need to get to a high confidence?
>> Especially if there is analysis/statistical help you need, Saptarshi
>> already has a lot of context. I want to make sure that we in relatively
>> short order *can* answer this in a way that release-drivers can trust to
>> make critical ship or no-ship decisions."""
>>
>> Great conversation, important issues. I'm going to repeat a slogan that
>> I've been saying for years, because recent progress at Mozilla indicates
>> that doing so has actually been effective ;-)
>>
>> **Anytime we release code without doing so as a real randomly controlled
>> experiment, we're doing an uncontrolled experiment on our entire user
>> population**
>>
>> It is not a mistake that Randomized Controlled Trials are the gold
>> standard for *nailing down* causal relationships in Science. Over the past
>> I dunno 18 months-ish, Mozilla has gotten way better at using RCTs to gain
>> real clarity about the impact of changes, but we still don't do it for
>> everything... the final frontier is doing RCTs for entire builds/releases.
>>
>> As you all are very well aware, when you are watching things like crash
>> numbers or start-up times bounce around from day to day and version to
>> version, it's super hard to get clarity about what is causing the bounces--
>> there are a ton of confounding variables. At a minimum:
>> - there is the effect of builds/releases, which is the thing we can
>> control, and hence the thing whose effect we really actually care about
>> - there are seasonal and weekly cycles that are pretty well understood,
>> but which (as anyone has looked at this stuff knows...) can still be pretty
>> messy, and which impacts different parts of out population differently
>> (e.g. heavy vs light users)
>> - there are random shocks to the the signal that are caused by some
>> change in the software ecosystem on the client (like the DLL injection
>> problem Harald mentioned) or by a change to some prominent website
>> - there are random shocks to the system that happen because of human
>> behaviors-- news cycles etc changing browsing habits on short time scales
>>
>> And so I say this now as a statistician who believes in the value of good
>> analytical/statistical work: if you *really* want to get to high confidence
>> about the health of a release, the *only* way to do so is to reframe
>> releases as RCTs. There is simply too much noise in our timeseries signals,
>> and we don't have access to enough of the exogenous explanatory variables
>> to be able to control for that noise. This is not an issue where you can
>> just pour more stats on it and it will go away; it's irreducible, and it's
>> why all of Science uses RCTs whenever possible (and also why ton's of
>> software companies do...).
>>
>> Luckily, we control our whole system, so for us it _is_ possible. It
>> would require some thought about the details of the implementation, and it
>> would surely take a ton of work from release management, but if we rolled
>> version out as A/B tests with randomly assigned update vs non-update, we
>> could know *for sure* what impact new code has on crashes, startup time,
>> engagement, retention, browser perf, etc etc. [Note: throttling releases is
>> good for smoke testing this stuff, but it's not the same as real
>> randomization]
>>
>> Anyway, I'm sure there are a million reasons why people won't want to
>> make a big investment in rethinking giant portions of how release
>> management works, so in the meantime i'll just leave this suggestion here
>> to percolate in your brains. Maybe in another two years we'll be sick
>> enough of all this uncertainty to make that investment; when we get to that
>> point, ping me, joy, ilana, etc and we can work on the details ;-)
>>
>> -bc
>>
>>
>> On Fri, Mar 3, 2017 at 1:52 PM, Benjamin Smedberg <benjamin at smedbergs.us>
>> wrote:
>>
>>> Yes, I'm aware of the fundamental difficulty and that's one of the
>>> reasons we're prioritizing pingsender.
>>>
>>> Can you describe what other work we need to get to a high confidence?
>>> Especially if there is analysis/statistical help you need, Saptarshi
>>> already has a lot of context. I want to make sure that we in relatively
>>> short order *can* answer this in a way that release-drivers can trust to
>>> make critical ship or no-ship decisions.
>>>
>>> -BDS
>>>
>>>
>>> On Wed, Mar 1, 2017 at 3:18 PM, Chris Hutten-Czapski <
>>> chutten at mozilla.com> wrote:
>>>
>>>> "How quickly can we get from [a release] to a reliable crash rate?"
>>>>
>>>> Well, if you believe my analysis[1] (which you may, if you'd like) the
>>>> answer is "at least a day out, but probably best to wait at least a day
>>>> longer than that"
>>>>
>>>> But aside from shameless self-promotion, there's the real concern that
>>>> I don't actually have a proper model for when we're allowed to trust crash
>>>> rates. When is the calculated crash rate actually indicative of a release's
>>>> health? Open question. And one whose answer changes over time, with
>>>> pingSender changing the speed at which we receive crucial inputs.
>>>>
>>>> :chutten
>>>>
>>>> [1]: https://chuttenblog.wordpress.com/2017/02/09/data-science-is
>>>> -hard-client-delays-for-crash-pings/
>>>>
>>>> On Wed, Mar 1, 2017 at 3:02 PM, Benjamin Smedberg <
>>>> benjamin at smedbergs.us> wrote:
>>>>
>>>>> Think of the per-build data like this:
>>>>>
>>>>> * our crash rate for FF53 b2 is too goddamn high!
>>>>> * We pulled a topcrash list and found a regression bug
>>>>> * We fixed it and uplifted it to FF53b4
>>>>> * Release drivers want to make sure that FF53b4 has the crash rate
>>>>> reduction that we expected.
>>>>>
>>>>> In this case (which is very common on all the prerelease channels),
>>>>> showing by date and not by build smooths out the signal we care about most,
>>>>> which is the difference per-build. So what release drivers care about most
>>>>> is: how quickly can we get from releasing beta8 (or RC1) to a reliable
>>>>> crash rate that says we're clear to ship this to release?
>>>>>
>>>>> --BDS
>>>>>
>>>>>
>>>>> On Wed, Mar 1, 2017 at 2:45 PM, Chris Hutten-Czapski <
>>>>> chutten at mozilla.com> wrote:
>>>>>
>>>>>> Thank you for correcting my mistake about build_id. Apparently when I
>>>>>> was looking for readable versions I took the lack of a bX suffix on beta
>>>>>> builds to mean there was no build data, but that was wrong then and wrong
>>>>>> now.
>>>>>>
>>>>>> So yeah. No problemo, apparently. (Well, except my misapprehension,
>>>>>> which should now be rectified.)
>>>>>>
>>>>>> Anyway, back to wishlisting... The request I can most easily
>>>>>> understand is for the existing display to instead use the current and N
>>>>>> previous _builds'_ crash counts and kuh to form a "channel health trend
>>>>>> line". N may be some fixed, small number per channel (maybe 2, 6, 14, 21),
>>>>>> or some function of usage (take all successive previous builds until they
>>>>>> contain > Y% of that activity_date's kuh).
>>>>>>
>>>>>> This sounds fun and useful and is something I feel I understand well
>>>>>> enough to file an Issue for and begin working on.
>>>>>>
>>>>>> The per-build case on the other hand I feel I understand less well.
>>>>>> Do the pseudo-timeseries plots as seen on sql.tmo work as well on channels
>>>>>> with fewer builds? If instead we display the crash rate for one build as
>>>>>> one number (per type of crash), does it need to be plotted at all? Is
>>>>>> Harald's view of 52b5 actually a problem? The best display for
>>>>>> understanding the health of a new release is...
>>>>>>
>>>>>> :chutten
>>>>>>
>>>>>> On Wed, Mar 1, 2017 at 2:08 PM, Benjamin Smedberg <
>>>>>> benjamin at smedbergs.us> wrote:
>>>>>>
>>>>>>> On Wed, Mar 1, 2017 at 1:40 PM, Chris Hutten-Czapski <
>>>>>>> chutten at mozilla.com> wrote:
>>>>>>>
>>>>>>>> So, for each channel, having a line for each of the current and N
>>>>>>>> previous versions' crash rates would be helpful? (where N is small... say 2)
>>>>>>>>
>>>>>>>
>>>>>>> It's closer. If it's possible to have a single line that is the
>>>>>>> aggregate, that might smooth out adoption noise.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> crash_aggregates does its aggregation by version, not build, so if
>>>>>>>> crash-rates-per-build is necessary this will require quite a rewrite. (I'll
>>>>>>>> have to join main_summary and crash_summary on dates) If
>>>>>>>> crash-rates-per-version is sufficient (or is worth the effort of
>>>>>>>> exploring), then that can be adopted within the existing architecture.
>>>>>>>>
>>>>>>>
>>>>>>> This is incorrect. build_id is one of the dimensions, specifically
>>>>>>> because we needed it for betas with e10s. Example from stmo:
>>>>>>>
>>>>>>> activity_date dimensions stats submission_date
>>>>>>> 24/08/16{"build_id":"20150804030204","os_name":"Windows_NT",
>>>>>>> "os_version":"6.2","country":"NP","application":"Firefox","a
>>>>>>> rchitecture":"x86-64","build_version":"42.0a1","channel":"ni
>>>>>>> ghtly","e10s_enabled":"False"}{"usage_hours_squared":0.01585
>>>>>>> 779320987654,"main_crashes":0,"content_shutdown_crashes":0,"
>>>>>>> usage_hours":0.15416666666666667,"content_shutdown_crashes_s
>>>>>>> quared":0,"gmplugin_crashes":0,"content_crashes":0,"content_
>>>>>>> crashes_squared":0,"plugin_crashes":0,"plugin_crashes_square
>>>>>>> d":0,"ping_count":3,"gmplugin_crashes_squared":0,"main_crash
>>>>>>> es_squared":0}25/08/16
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> One thing to consider may be that crash-rates-per-build will look
>>>>>>>> messy on anything pre-Beta.
>>>>>>>>
>>>>>>>
>>>>>>> Could be, but here's an example way to present this for nightly:
>>>>>>> http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-da
>>>>>>> y-empty-minidump-crashes-per-user/
>>>>>>>
>>>>>>> Here's also a draft of something a while back in STMO:
>>>>>>> https://sql.telemetry.mozilla.org/queries/192/source#309
>>>>>>>
>>>>>>> --BDS
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> fhr-dev mailing list
>>> fhr-dev at mozilla.org
>>> https://mail.mozilla.org/listinfo/fhr-dev
>>>
>>>
>>
> _______________________________________________
> fhr-dev mailing list
> fhr-dev at mozilla.org
> https://mail.mozilla.org/listinfo/fhr-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170308/ca8df101/attachment-0001.html>
More information about the fhr-dev
mailing list