TMO Stability Dashboard is Changing

Marco Castelluccio mcastelluccio at mozilla.com
Wed Mar 8 19:54:04 UTC 2017


This sounds interesting, but is it feasible? Can we hold users back and 
prevent them from updating to a possibly more secure version?

How many days would it take to have results from such a test? We 
currently roll out updates to 25% just for a few days.

I can see us doing this for Aurora/Beta, but I'm less sure about Release.

- Marco.


Il 08/03/17 20:40, Saptarshi Guha ha scritto:
> I've been saying something similar for a long time now. Ii advocate 
> for UT based rollout that can be controlled based on variables in UT 
> and unify the experiments and release rollout thought processes..
>
>  From an email I sent elsewhere (appropriate here)
>
> I also advocate (strongly!) doing releases based entirely on UT data. 
> A release (and even an experiment) can be released to profiles meeting 
> some criteria e.g. Release, geo:US, locale: en-us memory: low-end 
> *and* sampleid1 in 1, sampleid2 in 41.
> Here sampleid2 are independent versions of sample_id (e.g. 
> crc32(salt(client_id)) %% 100 ). That way we can release to as small 
> as 60,000 profiles. Moreover we can keep a file with this release 
> history (what filters, what sampleids chosen etc) and keep it in the 
> UT json itself.
> The quality of the release can be viewed in two ways: effect on 
> existing users (before vs after ) and test vs control.
>
> Cheers
> Saptarshi
>
>
>
>
> On Mon, Mar 6, 2017 at 11:57 AM -0800, "Brendan Colloran" 
> <bcolloran at mozilla.com <mailto:bcolloran at mozilla.com>> wrote:
>
>     """I don't actually have a proper model for when we're allowed to
>     trust crash rates. When is the calculated crash rate actually
>     indicative of a release's health? Open question. And one whose
>     answer changes over time, with pingSender changing the speed at
>     which we receive crucial inputs."""
>
>     """Can you describe what other work we need to get to a high
>     confidence? Especially if there is analysis/statistical help you
>     need, Saptarshi already has a lot of context. I want to make sure
>     that we in relatively short order *can* answer this in a way that
>     release-drivers can trust to make critical ship or no-ship
>     decisions."""
>
>     Great conversation, important issues. I'm going to repeat a slogan
>     that I've been saying for years, because recent progress at
>     Mozilla indicates that doing so has actually been effective ;-)
>
>     **Anytime we release code without doing so as a real randomly
>     controlled experiment, we're doing an uncontrolled experiment on
>     our entire user population**
>
>     It is not a mistake that Randomized Controlled Trials are the gold
>     standard for *nailing down* causal relationships in Science. Over
>     the past I dunno 18 months-ish, Mozilla has gotten way better at
>     using RCTs to gain real clarity about the impact of changes, but
>     we still don't do it for everything... the final frontier is doing
>     RCTs for entire builds/releases.
>
>     As you all are very well aware, when you are watching things like
>     crash numbers or start-up times bounce around from day to day and
>     version to version, it's super hard to get clarity about what is
>     causing the bounces-- there are a ton of confounding variables. At
>     a minimum:
>     - there is the effect of builds/releases, which is the thing we
>     can control, and hence the thing whose effect we really actually
>     care about
>     - there are seasonal and weekly cycles that are pretty well
>     understood, but which (as anyone has looked at this stuff
>     knows...) can still be pretty messy, and which impacts different
>     parts of out population differently (e.g. heavy vs light users)
>     - there are random shocks to the the signal that are caused by
>     some change in the software ecosystem on the client (like the DLL
>     injection problem Harald mentioned) or by a change to some
>     prominent website
>     - there are random shocks to the system that happen because of
>     human behaviors-- news cycles etc changing browsing habits on
>     short time scales
>
>     And so I say this now as a statistician who believes in the value
>     of good analytical/statistical work: if you *really* want to get
>     to high confidence about the health of a release, the *only* way
>     to do so is to reframe releases as RCTs. There is simply too much
>     noise in our timeseries signals, and we don't have access to
>     enough of the exogenous explanatory variables to be able to
>     control for that noise. This is not an issue where you can just
>     pour more stats on it and it will go away; it's irreducible, and
>     it's why all of Science uses RCTs whenever possible (and also why
>     ton's of software companies do...).
>
>     Luckily, we control our whole system, so for us it _is_ possible.
>     It would require some thought about the details of the
>     implementation, and it would surely take a ton of work from
>     release management, but if we rolled version out as A/B tests with
>     randomly assigned update vs non-update, we could know *for sure*
>     what impact new code has on crashes, startup time, engagement,
>     retention, browser perf, etc etc. [Note: throttling releases is
>     good for smoke testing this stuff, but it's not the same as real
>     randomization]
>
>     Anyway, I'm sure there are a million reasons why people won't want
>     to make a big investment in rethinking giant portions of how
>     release management works, so in the meantime i'll just leave this
>     suggestion here to percolate in your brains. Maybe in another two
>     years we'll be sick enough of all this uncertainty to make that
>     investment; when we get to that point, ping me, joy, ilana, etc
>     and we can work on the details ;-)
>
>     -bc
>
>
>     On Fri, Mar 3, 2017 at 1:52 PM, Benjamin Smedberg
>     <benjamin at smedbergs.us <mailto:benjamin at smedbergs.us>> wrote:
>
>         Yes, I'm aware of the fundamental difficulty and that's one of
>         the reasons we're prioritizing pingsender.
>
>         Can you describe what other work we need to get to a high
>         confidence? Especially if there is analysis/statistical help
>         you need, Saptarshi already has a lot of context. I want to
>         make sure that we in relatively short order *can* answer this
>         in a way that release-drivers can trust to make critical ship
>         or no-ship decisions.
>
>         -BDS
>
>
>         On Wed, Mar 1, 2017 at 3:18 PM, Chris Hutten-Czapski
>         <chutten at mozilla.com <mailto:chutten at mozilla.com>> wrote:
>
>             "How quickly can we get from [a release] to a reliable
>             crash rate?"
>
>             Well, if you believe my analysis[1] (which you may, if
>             you'd like) the answer is "at least a day out, but
>             probably best to wait at least a day longer than that"
>
>             But aside from shameless self-promotion, there's the real
>             concern that I don't actually have a proper model for when
>             we're allowed to trust crash rates. When is the calculated
>             crash rate actually indicative of a release's health? Open
>             question. And one whose answer changes over time, with
>             pingSender changing the speed at which we receive crucial
>             inputs.
>
>             :chutten
>
>             [1]:
>             https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/
>             <https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/>
>
>             On Wed, Mar 1, 2017 at 3:02 PM, Benjamin Smedberg
>             <benjamin at smedbergs.us <mailto:benjamin at smedbergs.us>> wrote:
>
>                 Think of the per-build data like this:
>
>                 * our crash rate for FF53 b2 is too goddamn high!
>                 * We pulled a topcrash list and found a regression bug
>                 * We fixed it and uplifted it to FF53b4
>                 * Release drivers want to make sure that FF53b4 has
>                 the crash rate reduction that we expected.
>
>                 In this case (which is very common on all the
>                 prerelease channels), showing by date and not by build
>                 smooths out the signal we care about most, which is
>                 the difference per-build. So what release drivers care
>                 about most is: how quickly can we get from releasing
>                 beta8 (or RC1) to a reliable crash rate that says
>                 we're clear to ship this to release?
>
>                 --BDS
>
>
>                 On Wed, Mar 1, 2017 at 2:45 PM, Chris Hutten-Czapski
>                 <chutten at mozilla.com <mailto:chutten at mozilla.com>> wrote:
>
>                     Thank you for correcting my mistake about
>                     build_id. Apparently when I was looking for
>                     readable versions I took the lack of a bX suffix
>                     on beta builds to mean there was no build data,
>                     but that was wrong then and wrong now.
>
>                     So yeah. No problemo, apparently. (Well, except my
>                     misapprehension, which should now be rectified.)
>
>                     Anyway, back to wishlisting... The request I can
>                     most easily understand is for the existing display
>                     to instead use the current and N previous
>                     _builds'_ crash counts and kuh to form a "channel
>                     health trend line". N may be some fixed, small
>                     number per channel (maybe 2, 6, 14, 21), or some
>                     function of usage (take all successive previous
>                     builds until they contain > Y% of that
>                     activity_date's kuh).
>
>                     This sounds fun and useful and is something I feel
>                     I understand well enough to file an Issue for and
>                     begin working on.
>
>                     The per-build case on the other hand I feel I
>                     understand less well. Do the pseudo-timeseries
>                     plots as seen on sql.tmo work as well on channels
>                     with fewer builds? If instead we display the crash
>                     rate for one build as one number (per type of
>                     crash), does it need to be plotted at all? Is
>                     Harald's view of 52b5 actually a problem? The best
>                     display for understanding the health of a new
>                     release is...
>
>                     :chutten
>
>                     On Wed, Mar 1, 2017 at 2:08 PM, Benjamin Smedberg
>                     <benjamin at smedbergs.us
>                     <mailto:benjamin at smedbergs.us>> wrote:
>
>                         On Wed, Mar 1, 2017 at 1:40 PM, Chris
>                         Hutten-Czapski <chutten at mozilla.com
>                         <mailto:chutten at mozilla.com>> wrote:
>
>                             So, for each channel, having a line for
>                             each of the current and N previous
>                             versions' crash rates would be helpful?
>                             (where N is small... say 2)
>
>
>                         It's closer. If it's possible to have a single
>                         line that is the aggregate, that might smooth
>                         out adoption noise.
>
>
>                             crash_aggregates does its aggregation by
>                             version, not build, so if
>                             crash-rates-per-build is necessary this
>                             will require quite a rewrite. (I'll have
>                             to join main_summary and crash_summary on
>                             dates) If crash-rates-per-version is
>                             sufficient (or is worth the effort of
>                             exploring), then that can be adopted
>                             within the existing architecture.
>
>
>                         This is incorrect. build_id is one of the
>                         dimensions, specifically because we needed it
>                         for betas with e10s. Example from stmo:
>
>                         activity_date dimensions stats submission_date
>                         24/08/16{"build_id":"20150804030204","os_name":"Windows_NT","os_version":"6.2","country":"NP","application":"Firefox","architecture":"x86-64","build_version":"42.0a1","channel":"nightly","e10s_enabled":"False"}{"usage_hours_squared":0.01585779320987654,"main_crashes":0,"content_shutdown_crashes":0,"usage_hours":0.15416666666666667,"content_shutdown_crashes_squared":0,"gmplugin_crashes":0,"content_crashes":0,"content_crashes_squared":0,"plugin_crashes":0,"plugin_crashes_squared":0,"ping_count":3,"gmplugin_crashes_squared":0,"main_crashes_squared":0}25/08/16
>
>
>
>                             One thing to consider may be that
>                             crash-rates-per-build will look messy on
>                             anything pre-Beta.
>
>
>                         Could be, but here's an example way to present
>                         this for nightly:
>                         http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-day-empty-minidump-crashes-per-user/
>                         <http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-day-empty-minidump-crashes-per-user/>
>
>                         Here's also a draft of something a while back
>                         in STMO:
>                         https://sql.telemetry.mozilla.org/queries/192/source#309
>                         <https://sql.telemetry.mozilla.org/queries/192/source#309>
>
>                         --BDS
>
>
>
>
>
>
>         _______________________________________________
>         fhr-dev mailing list
>         fhr-dev at mozilla.org <mailto:fhr-dev at mozilla.org>
>         https://mail.mozilla.org/listinfo/fhr-dev
>         <https://mail.mozilla.org/listinfo/fhr-dev>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170308/0cc1d0e7/attachment-0001.html>


More information about the fhr-dev mailing list