TMO Stability Dashboard is Changing

Benjamin Smedberg benjamin at smedbergs.us
Fri Mar 3 21:52:26 UTC 2017


Yes, I'm aware of the fundamental difficulty and that's one of the reasons
we're prioritizing pingsender.

Can you describe what other work we need to get to a high confidence?
Especially if there is analysis/statistical help you need, Saptarshi
already has a lot of context. I want to make sure that we in relatively
short order *can* answer this in a way that release-drivers can trust to
make critical ship or no-ship decisions.

-BDS


On Wed, Mar 1, 2017 at 3:18 PM, Chris Hutten-Czapski <chutten at mozilla.com>
wrote:

> "How quickly can we get from [a release] to a reliable crash rate?"
>
> Well, if you believe my analysis[1] (which you may, if you'd like) the
> answer is "at least a day out, but probably best to wait at least a day
> longer than that"
>
> But aside from shameless self-promotion, there's the real concern that I
> don't actually have a proper model for when we're allowed to trust crash
> rates. When is the calculated crash rate actually indicative of a release's
> health? Open question. And one whose answer changes over time, with
> pingSender changing the speed at which we receive crucial inputs.
>
> :chutten
>
> [1]: https://chuttenblog.wordpress.com/2017/02/09/data-science-
> is-hard-client-delays-for-crash-pings/
>
> On Wed, Mar 1, 2017 at 3:02 PM, Benjamin Smedberg <benjamin at smedbergs.us>
> wrote:
>
>> Think of the per-build data like this:
>>
>> * our crash rate for FF53 b2 is too goddamn high!
>> * We pulled a topcrash list and found a regression bug
>> * We fixed it and uplifted it to FF53b4
>> * Release drivers want to make sure that FF53b4 has the crash rate
>> reduction that we expected.
>>
>> In this case (which is very common on all the prerelease channels),
>> showing by date and not by build smooths out the signal we care about most,
>> which is the difference per-build. So what release drivers care about most
>> is: how quickly can we get from releasing beta8 (or RC1) to a reliable
>> crash rate that says we're clear to ship this to release?
>>
>> --BDS
>>
>>
>> On Wed, Mar 1, 2017 at 2:45 PM, Chris Hutten-Czapski <chutten at mozilla.com
>> > wrote:
>>
>>> Thank you for correcting my mistake about build_id. Apparently when I
>>> was looking for readable versions I took the lack of a bX suffix on beta
>>> builds to mean there was no build data, but that was wrong then and wrong
>>> now.
>>>
>>> So yeah. No problemo, apparently. (Well, except my misapprehension,
>>> which should now be rectified.)
>>>
>>> Anyway, back to wishlisting... The request I can most easily understand
>>> is for the existing display to instead use the current and N previous
>>> _builds'_ crash counts and kuh to form a "channel health trend line". N may
>>> be some fixed, small number per channel (maybe 2, 6, 14, 21), or some
>>> function of usage (take all successive previous builds until they contain >
>>> Y% of that activity_date's kuh).
>>>
>>> This sounds fun and useful and is something I feel I understand well
>>> enough to file an Issue for and begin working on.
>>>
>>> The per-build case on the other hand I feel I understand less well. Do
>>> the pseudo-timeseries plots as seen on sql.tmo work as well on channels
>>> with fewer builds? If instead we display the crash rate for one build as
>>> one number (per type of crash), does it need to be plotted at all? Is
>>> Harald's view of 52b5 actually a problem? The best display for
>>> understanding the health of a new release is...
>>>
>>> :chutten
>>>
>>> On Wed, Mar 1, 2017 at 2:08 PM, Benjamin Smedberg <benjamin at smedbergs.us
>>> > wrote:
>>>
>>>> On Wed, Mar 1, 2017 at 1:40 PM, Chris Hutten-Czapski <
>>>> chutten at mozilla.com> wrote:
>>>>
>>>>> So, for each channel, having a line for each of the current and N
>>>>> previous versions' crash rates would be helpful? (where N is small... say 2)
>>>>>
>>>>
>>>> It's closer. If it's possible to have a single line that is the
>>>> aggregate, that might smooth out adoption noise.
>>>>
>>>>
>>>>>
>>>>> crash_aggregates does its aggregation by version, not build, so if
>>>>> crash-rates-per-build is necessary this will require quite a rewrite. (I'll
>>>>> have to join main_summary and crash_summary on dates) If
>>>>> crash-rates-per-version is sufficient (or is worth the effort of
>>>>> exploring), then that can be adopted within the existing architecture.
>>>>>
>>>>
>>>> This is incorrect. build_id is one of the dimensions, specifically
>>>> because we needed it for betas with e10s. Example from stmo:
>>>>
>>>> activity_date dimensions stats submission_date
>>>> 24/08/16{"build_id":"20150804030204","os_name":"Windows_NT",
>>>> "os_version":"6.2","country":"NP","application":"Firefox","a
>>>> rchitecture":"x86-64","build_version":"42.0a1","channel":"ni
>>>> ghtly","e10s_enabled":"False"}{"usage_hours_squared":0.01585
>>>> 779320987654,"main_crashes":0,"content_shutdown_crashes":0,"
>>>> usage_hours":0.15416666666666667,"content_shutdown_crashes_s
>>>> quared":0,"gmplugin_crashes":0,"content_crashes":0,"content_
>>>> crashes_squared":0,"plugin_crashes":0,"plugin_crashes_square
>>>> d":0,"ping_count":3,"gmplugin_crashes_squared":0,"main_
>>>> crashes_squared":0}25/08/16
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> One thing to consider may be that crash-rates-per-build will look
>>>>> messy on anything pre-Beta.
>>>>>
>>>>
>>>> Could be, but here's an example way to present this for nightly:
>>>> http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-da
>>>> y-empty-minidump-crashes-per-user/
>>>>
>>>> Here's also a draft of something a while back in STMO:
>>>> https://sql.telemetry.mozilla.org/queries/192/source#309
>>>>
>>>> --BDS
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170303/03db2594/attachment-0001.html>


More information about the fhr-dev mailing list