TMO Stability Dashboard is Changing
Marco Castelluccio
mcastelluccio at mozilla.com
Wed Mar 8 19:54:04 UTC 2017
This sounds interesting, but is it feasible? Can we hold users back and
prevent them from updating to a possibly more secure version?
How many days would it take to have results from such a test? We
currently roll out updates to 25% just for a few days.
I can see us doing this for Aurora/Beta, but I'm less sure about Release.
- Marco.
Il 08/03/17 20:40, Saptarshi Guha ha scritto:
> I've been saying something similar for a long time now. Ii advocate
> for UT based rollout that can be controlled based on variables in UT
> and unify the experiments and release rollout thought processes..
>
> From an email I sent elsewhere (appropriate here)
>
> I also advocate (strongly!) doing releases based entirely on UT data.
> A release (and even an experiment) can be released to profiles meeting
> some criteria e.g. Release, geo:US, locale: en-us memory: low-end
> *and* sampleid1 in 1, sampleid2 in 41.
> Here sampleid2 are independent versions of sample_id (e.g.
> crc32(salt(client_id)) %% 100 ). That way we can release to as small
> as 60,000 profiles. Moreover we can keep a file with this release
> history (what filters, what sampleids chosen etc) and keep it in the
> UT json itself.
> The quality of the release can be viewed in two ways: effect on
> existing users (before vs after ) and test vs control.
>
> Cheers
> Saptarshi
>
>
>
>
> On Mon, Mar 6, 2017 at 11:57 AM -0800, "Brendan Colloran"
> <bcolloran at mozilla.com <mailto:bcolloran at mozilla.com>> wrote:
>
> """I don't actually have a proper model for when we're allowed to
> trust crash rates. When is the calculated crash rate actually
> indicative of a release's health? Open question. And one whose
> answer changes over time, with pingSender changing the speed at
> which we receive crucial inputs."""
>
> """Can you describe what other work we need to get to a high
> confidence? Especially if there is analysis/statistical help you
> need, Saptarshi already has a lot of context. I want to make sure
> that we in relatively short order *can* answer this in a way that
> release-drivers can trust to make critical ship or no-ship
> decisions."""
>
> Great conversation, important issues. I'm going to repeat a slogan
> that I've been saying for years, because recent progress at
> Mozilla indicates that doing so has actually been effective ;-)
>
> **Anytime we release code without doing so as a real randomly
> controlled experiment, we're doing an uncontrolled experiment on
> our entire user population**
>
> It is not a mistake that Randomized Controlled Trials are the gold
> standard for *nailing down* causal relationships in Science. Over
> the past I dunno 18 months-ish, Mozilla has gotten way better at
> using RCTs to gain real clarity about the impact of changes, but
> we still don't do it for everything... the final frontier is doing
> RCTs for entire builds/releases.
>
> As you all are very well aware, when you are watching things like
> crash numbers or start-up times bounce around from day to day and
> version to version, it's super hard to get clarity about what is
> causing the bounces-- there are a ton of confounding variables. At
> a minimum:
> - there is the effect of builds/releases, which is the thing we
> can control, and hence the thing whose effect we really actually
> care about
> - there are seasonal and weekly cycles that are pretty well
> understood, but which (as anyone has looked at this stuff
> knows...) can still be pretty messy, and which impacts different
> parts of out population differently (e.g. heavy vs light users)
> - there are random shocks to the the signal that are caused by
> some change in the software ecosystem on the client (like the DLL
> injection problem Harald mentioned) or by a change to some
> prominent website
> - there are random shocks to the system that happen because of
> human behaviors-- news cycles etc changing browsing habits on
> short time scales
>
> And so I say this now as a statistician who believes in the value
> of good analytical/statistical work: if you *really* want to get
> to high confidence about the health of a release, the *only* way
> to do so is to reframe releases as RCTs. There is simply too much
> noise in our timeseries signals, and we don't have access to
> enough of the exogenous explanatory variables to be able to
> control for that noise. This is not an issue where you can just
> pour more stats on it and it will go away; it's irreducible, and
> it's why all of Science uses RCTs whenever possible (and also why
> ton's of software companies do...).
>
> Luckily, we control our whole system, so for us it _is_ possible.
> It would require some thought about the details of the
> implementation, and it would surely take a ton of work from
> release management, but if we rolled version out as A/B tests with
> randomly assigned update vs non-update, we could know *for sure*
> what impact new code has on crashes, startup time, engagement,
> retention, browser perf, etc etc. [Note: throttling releases is
> good for smoke testing this stuff, but it's not the same as real
> randomization]
>
> Anyway, I'm sure there are a million reasons why people won't want
> to make a big investment in rethinking giant portions of how
> release management works, so in the meantime i'll just leave this
> suggestion here to percolate in your brains. Maybe in another two
> years we'll be sick enough of all this uncertainty to make that
> investment; when we get to that point, ping me, joy, ilana, etc
> and we can work on the details ;-)
>
> -bc
>
>
> On Fri, Mar 3, 2017 at 1:52 PM, Benjamin Smedberg
> <benjamin at smedbergs.us <mailto:benjamin at smedbergs.us>> wrote:
>
> Yes, I'm aware of the fundamental difficulty and that's one of
> the reasons we're prioritizing pingsender.
>
> Can you describe what other work we need to get to a high
> confidence? Especially if there is analysis/statistical help
> you need, Saptarshi already has a lot of context. I want to
> make sure that we in relatively short order *can* answer this
> in a way that release-drivers can trust to make critical ship
> or no-ship decisions.
>
> -BDS
>
>
> On Wed, Mar 1, 2017 at 3:18 PM, Chris Hutten-Czapski
> <chutten at mozilla.com <mailto:chutten at mozilla.com>> wrote:
>
> "How quickly can we get from [a release] to a reliable
> crash rate?"
>
> Well, if you believe my analysis[1] (which you may, if
> you'd like) the answer is "at least a day out, but
> probably best to wait at least a day longer than that"
>
> But aside from shameless self-promotion, there's the real
> concern that I don't actually have a proper model for when
> we're allowed to trust crash rates. When is the calculated
> crash rate actually indicative of a release's health? Open
> question. And one whose answer changes over time, with
> pingSender changing the speed at which we receive crucial
> inputs.
>
> :chutten
>
> [1]:
> https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/
> <https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/>
>
> On Wed, Mar 1, 2017 at 3:02 PM, Benjamin Smedberg
> <benjamin at smedbergs.us <mailto:benjamin at smedbergs.us>> wrote:
>
> Think of the per-build data like this:
>
> * our crash rate for FF53 b2 is too goddamn high!
> * We pulled a topcrash list and found a regression bug
> * We fixed it and uplifted it to FF53b4
> * Release drivers want to make sure that FF53b4 has
> the crash rate reduction that we expected.
>
> In this case (which is very common on all the
> prerelease channels), showing by date and not by build
> smooths out the signal we care about most, which is
> the difference per-build. So what release drivers care
> about most is: how quickly can we get from releasing
> beta8 (or RC1) to a reliable crash rate that says
> we're clear to ship this to release?
>
> --BDS
>
>
> On Wed, Mar 1, 2017 at 2:45 PM, Chris Hutten-Czapski
> <chutten at mozilla.com <mailto:chutten at mozilla.com>> wrote:
>
> Thank you for correcting my mistake about
> build_id. Apparently when I was looking for
> readable versions I took the lack of a bX suffix
> on beta builds to mean there was no build data,
> but that was wrong then and wrong now.
>
> So yeah. No problemo, apparently. (Well, except my
> misapprehension, which should now be rectified.)
>
> Anyway, back to wishlisting... The request I can
> most easily understand is for the existing display
> to instead use the current and N previous
> _builds'_ crash counts and kuh to form a "channel
> health trend line". N may be some fixed, small
> number per channel (maybe 2, 6, 14, 21), or some
> function of usage (take all successive previous
> builds until they contain > Y% of that
> activity_date's kuh).
>
> This sounds fun and useful and is something I feel
> I understand well enough to file an Issue for and
> begin working on.
>
> The per-build case on the other hand I feel I
> understand less well. Do the pseudo-timeseries
> plots as seen on sql.tmo work as well on channels
> with fewer builds? If instead we display the crash
> rate for one build as one number (per type of
> crash), does it need to be plotted at all? Is
> Harald's view of 52b5 actually a problem? The best
> display for understanding the health of a new
> release is...
>
> :chutten
>
> On Wed, Mar 1, 2017 at 2:08 PM, Benjamin Smedberg
> <benjamin at smedbergs.us
> <mailto:benjamin at smedbergs.us>> wrote:
>
> On Wed, Mar 1, 2017 at 1:40 PM, Chris
> Hutten-Czapski <chutten at mozilla.com
> <mailto:chutten at mozilla.com>> wrote:
>
> So, for each channel, having a line for
> each of the current and N previous
> versions' crash rates would be helpful?
> (where N is small... say 2)
>
>
> It's closer. If it's possible to have a single
> line that is the aggregate, that might smooth
> out adoption noise.
>
>
> crash_aggregates does its aggregation by
> version, not build, so if
> crash-rates-per-build is necessary this
> will require quite a rewrite. (I'll have
> to join main_summary and crash_summary on
> dates) If crash-rates-per-version is
> sufficient (or is worth the effort of
> exploring), then that can be adopted
> within the existing architecture.
>
>
> This is incorrect. build_id is one of the
> dimensions, specifically because we needed it
> for betas with e10s. Example from stmo:
>
> activity_date dimensions stats submission_date
> 24/08/16{"build_id":"20150804030204","os_name":"Windows_NT","os_version":"6.2","country":"NP","application":"Firefox","architecture":"x86-64","build_version":"42.0a1","channel":"nightly","e10s_enabled":"False"}{"usage_hours_squared":0.01585779320987654,"main_crashes":0,"content_shutdown_crashes":0,"usage_hours":0.15416666666666667,"content_shutdown_crashes_squared":0,"gmplugin_crashes":0,"content_crashes":0,"content_crashes_squared":0,"plugin_crashes":0,"plugin_crashes_squared":0,"ping_count":3,"gmplugin_crashes_squared":0,"main_crashes_squared":0}25/08/16
>
>
>
> One thing to consider may be that
> crash-rates-per-build will look messy on
> anything pre-Beta.
>
>
> Could be, but here's an example way to present
> this for nightly:
> http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-day-empty-minidump-crashes-per-user/
> <http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-day-empty-minidump-crashes-per-user/>
>
> Here's also a draft of something a while back
> in STMO:
> https://sql.telemetry.mozilla.org/queries/192/source#309
> <https://sql.telemetry.mozilla.org/queries/192/source#309>
>
> --BDS
>
>
>
>
>
>
> _______________________________________________
> fhr-dev mailing list
> fhr-dev at mozilla.org <mailto:fhr-dev at mozilla.org>
> https://mail.mozilla.org/listinfo/fhr-dev
> <https://mail.mozilla.org/listinfo/fhr-dev>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170308/0cc1d0e7/attachment-0001.html>
More information about the fhr-dev
mailing list