TMO Stability Dashboard is Changing

Chris Hutten-Czapski chutten at mozilla.com
Wed Mar 1 20:18:41 UTC 2017


"How quickly can we get from [a release] to a reliable crash rate?"

Well, if you believe my analysis[1] (which you may, if you'd like) the
answer is "at least a day out, but probably best to wait at least a day
longer than that"

But aside from shameless self-promotion, there's the real concern that I
don't actually have a proper model for when we're allowed to trust crash
rates. When is the calculated crash rate actually indicative of a release's
health? Open question. And one whose answer changes over time, with
pingSender changing the speed at which we receive crucial inputs.

:chutten

[1]:
https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/

On Wed, Mar 1, 2017 at 3:02 PM, Benjamin Smedberg <benjamin at smedbergs.us>
wrote:

> Think of the per-build data like this:
>
> * our crash rate for FF53 b2 is too goddamn high!
> * We pulled a topcrash list and found a regression bug
> * We fixed it and uplifted it to FF53b4
> * Release drivers want to make sure that FF53b4 has the crash rate
> reduction that we expected.
>
> In this case (which is very common on all the prerelease channels),
> showing by date and not by build smooths out the signal we care about most,
> which is the difference per-build. So what release drivers care about most
> is: how quickly can we get from releasing beta8 (or RC1) to a reliable
> crash rate that says we're clear to ship this to release?
>
> --BDS
>
>
> On Wed, Mar 1, 2017 at 2:45 PM, Chris Hutten-Czapski <chutten at mozilla.com>
> wrote:
>
>> Thank you for correcting my mistake about build_id. Apparently when I was
>> looking for readable versions I took the lack of a bX suffix on beta builds
>> to mean there was no build data, but that was wrong then and wrong now.
>>
>> So yeah. No problemo, apparently. (Well, except my misapprehension, which
>> should now be rectified.)
>>
>> Anyway, back to wishlisting... The request I can most easily understand
>> is for the existing display to instead use the current and N previous
>> _builds'_ crash counts and kuh to form a "channel health trend line". N may
>> be some fixed, small number per channel (maybe 2, 6, 14, 21), or some
>> function of usage (take all successive previous builds until they contain >
>> Y% of that activity_date's kuh).
>>
>> This sounds fun and useful and is something I feel I understand well
>> enough to file an Issue for and begin working on.
>>
>> The per-build case on the other hand I feel I understand less well. Do
>> the pseudo-timeseries plots as seen on sql.tmo work as well on channels
>> with fewer builds? If instead we display the crash rate for one build as
>> one number (per type of crash), does it need to be plotted at all? Is
>> Harald's view of 52b5 actually a problem? The best display for
>> understanding the health of a new release is...
>>
>> :chutten
>>
>> On Wed, Mar 1, 2017 at 2:08 PM, Benjamin Smedberg <benjamin at smedbergs.us>
>> wrote:
>>
>>> On Wed, Mar 1, 2017 at 1:40 PM, Chris Hutten-Czapski <
>>> chutten at mozilla.com> wrote:
>>>
>>>> So, for each channel, having a line for each of the current and N
>>>> previous versions' crash rates would be helpful? (where N is small... say 2)
>>>>
>>>
>>> It's closer. If it's possible to have a single line that is the
>>> aggregate, that might smooth out adoption noise.
>>>
>>>
>>>>
>>>> crash_aggregates does its aggregation by version, not build, so if
>>>> crash-rates-per-build is necessary this will require quite a rewrite. (I'll
>>>> have to join main_summary and crash_summary on dates) If
>>>> crash-rates-per-version is sufficient (or is worth the effort of
>>>> exploring), then that can be adopted within the existing architecture.
>>>>
>>>
>>> This is incorrect. build_id is one of the dimensions, specifically
>>> because we needed it for betas with e10s. Example from stmo:
>>>
>>> activity_date dimensions stats submission_date
>>> 24/08/16{"build_id":"20150804030204","os_name":"Windows_NT",
>>> "os_version":"6.2","country":"NP","application":"Firefox","a
>>> rchitecture":"x86-64","build_version":"42.0a1","channel":"ni
>>> ghtly","e10s_enabled":"False"}{"usage_hours_squared":0.01585
>>> 779320987654,"main_crashes":0,"content_shutdown_crashes":0,"
>>> usage_hours":0.15416666666666667,"content_shutdown_crashes_
>>> squared":0,"gmplugin_crashes":0,"content_crashes":0,"content
>>> _crashes_squared":0,"plugin_crashes":0,"plugin_crashes_
>>> squared":0,"ping_count":3,"gmplugin_crashes_squared":0,"m
>>> ain_crashes_squared":0}25/08/16
>>>
>>>
>>>
>>>
>>>>
>>>> One thing to consider may be that crash-rates-per-build will look messy
>>>> on anything pre-Beta.
>>>>
>>>
>>> Could be, but here's an example way to present this for nightly:
>>> http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-da
>>> y-empty-minidump-crashes-per-user/
>>>
>>> Here's also a draft of something a while back in STMO:
>>> https://sql.telemetry.mozilla.org/queries/192/source#309
>>>
>>> --BDS
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170301/8dfb768b/attachment.html>


More information about the fhr-dev mailing list