TMO Stability Dashboard is Changing
Robert Strong
rstrong at mozilla.com
Wed Mar 8 21:11:23 UTC 2017
As an aside and for my own understanding, in the past geo was determined on
the server side yet you have the client providing geo. Is geo provided by
the client now?
Thanks,
Robert
On Wed, Mar 8, 2017 at 12:59 PM, Saptarshi Guha <sguha at mozilla.com> wrote:
> Good question. I am running Firefox build B, my browser pings the server
> for an
> update (and includes any identifier it may have, i.e. the 'id' column in
> the
> JSON below), it receives this JSON
>
> {
> id: bdd2e0eb47ff32267f0a928a83f17bbb
> criteria:
> {
> channel: aurora,
> locale: en-us,
> geo: US,
> arch: x86-64,
> existingBuildRange: ['20170301',''] ,
> sampleID1: [45], #random number between 0...99 based on clientID
> sampleID2: [0,99]
> },
> arm: 'test',
> payloadPatch: ....
> possiblyexpiresAfter: 5 days
> }
>
> The browser checks if it fulfills the criteria (1% sample of
> channel,locale,
> geo, arch and existing build ranges). Also each profile is always in a
> predetermine bucket: we choose buckets not profiles.
>
> If so, the browser is candidate for an update. Currently we have 25% for
> first
> few days. This can be implemented using a match-all criteria but set
> sampleID1
> to be 25 different numbers.
>
> If the browser is not a candidate, we proceed as usual. If it is a
> candidate,
> then:
>
> - update to the version we specify in 'arm' ( either by installing
> hotfixes/addons/new binary) but persist the UT data so that the updated
> browser knows why it updated i.e. because it was selected to update
> - arm can be several types:
> - one must be 'control': the browser doesn't really update, but remains
> on the
> same build, or it updates to the new build with all new features turned
> off. Depending on how we implement a new build (is it just a matter of
> addons, or very much a new binary) we might have leave the user on the
> same
> build B or update to new build B' but with all new features *off*
> - another arm would be varieties of 'test':
> - test1 could be one version of build B'
> - test2 another version
>
> We now have 1% of the filtered browsers running on B' be it
> control/test1/test2
> etc (as mentioned above B' for control could well be B).
>
> They will run for 5 days (possiblyexpiresAfter) and then query for update
> or go back
> to build B (e.g. instability). Since the browser pings our server with the
> 'id'
> field, we can have special JSONs to return for these browsers (because the
> experimen/release might have been a terrible security/user/stability
> experience)
> e.g.
>
> - go back to build B
> - immediately go to build B''
>
> Analyses:
>
> Much like an experiment, we can see several measures of interest for the
> days
> being run for these browsers :
>
> - measure opt-out (if person forced an update,"I hate this!")
> - measure performance/stability/user experience across different arms
> - stop early (use a small value for possiblyexpiresAfter, which doesn't
> force
> an expiry but allows for one or maybe continue further)
>
> We can compare
> - for test: before update, vs after update
> - for control: use as a baseline,also before and after
> Both of these provide 'interaction': we control for the longitudinal
> experience
> of users (user variation is huge) by looking at their before vs after
> experience
> but also look at test vs control.
>
> Notice, I've frame the release as an experiment. The complicated bit (and i
> understate that bit ...) is engineering the release as an experiment ...
>
> Hope this clarifies things and encourages further discussion
>
>
>
>
>
>
>
>
>
> On Wed, Mar 8, 2017 at 11:54 AM, Marco Castelluccio <
> mcastelluccio at mozilla.com> wrote:
>
>> This sounds interesting, but is it feasible? Can we hold users back and
>> prevent them from updating to a possibly more secure version?
>>
>> How many days would it take to have results from such a test? We
>> currently roll out updates to 25% just for a few days.
>>
>> I can see us doing this for Aurora/Beta, but I'm less sure about Release.
>>
>> - Marco.
>>
>> Il 08/03/17 20:40, Saptarshi Guha ha scritto:
>>
>> I've been saying something similar for a long time now. Ii advocate for
>> UT based rollout that can be controlled based on variables in UT and unify
>> the experiments and release rollout thought processes..
>>
>> From an email I sent elsewhere (appropriate here)
>>
>> I also advocate (strongly!) doing releases based entirely on UT data. A
>> release (and even an experiment) can be released to profiles meeting some
>> criteria e.g. Release, geo:US, locale: en-us memory: low-end *and*
>> sampleid1 in 1, sampleid2 in 41.
>> Here sampleid2 are independent versions of sample_id (e.g.
>> crc32(salt(client_id)) %% 100 ). That way we can release to as small as
>> 60,000 profiles. Moreover we can keep a file with this release history
>> (what filters, what sampleids chosen etc) and keep it in the UT json
>> itself.
>> The quality of the release can be viewed in two ways: effect on existing
>> users (before vs after ) and test vs control.
>>
>> Cheers
>> Saptarshi
>>
>>
>>
>>
>> On Mon, Mar 6, 2017 at 11:57 AM -0800, "Brendan Colloran" <
>> bcolloran at mozilla.com> wrote:
>>
>> """I don't actually have a proper model for when we're allowed to trust
>>> crash rates. When is the calculated crash rate actually indicative of a
>>> release's health? Open question. And one whose answer changes over time,
>>> with pingSender changing the speed at which we receive crucial inputs."""
>>>
>>> """Can you describe what other work we need to get to a high confidence?
>>> Especially if there is analysis/statistical help you need, Saptarshi
>>> already has a lot of context. I want to make sure that we in relatively
>>> short order *can* answer this in a way that release-drivers can trust to
>>> make critical ship or no-ship decisions."""
>>>
>>> Great conversation, important issues. I'm going to repeat a slogan that
>>> I've been saying for years, because recent progress at Mozilla indicates
>>> that doing so has actually been effective ;-)
>>>
>>> **Anytime we release code without doing so as a real randomly controlled
>>> experiment, we're doing an uncontrolled experiment on our entire user
>>> population**
>>>
>>> It is not a mistake that Randomized Controlled Trials are the gold
>>> standard for *nailing down* causal relationships in Science. Over the past
>>> I dunno 18 months-ish, Mozilla has gotten way better at using RCTs to gain
>>> real clarity about the impact of changes, but we still don't do it for
>>> everything... the final frontier is doing RCTs for entire builds/releases.
>>>
>>> As you all are very well aware, when you are watching things like crash
>>> numbers or start-up times bounce around from day to day and version to
>>> version, it's super hard to get clarity about what is causing the bounces--
>>> there are a ton of confounding variables. At a minimum:
>>> - there is the effect of builds/releases, which is the thing we can
>>> control, and hence the thing whose effect we really actually care about
>>> - there are seasonal and weekly cycles that are pretty well understood,
>>> but which (as anyone has looked at this stuff knows...) can still be pretty
>>> messy, and which impacts different parts of out population differently
>>> (e.g. heavy vs light users)
>>> - there are random shocks to the the signal that are caused by some
>>> change in the software ecosystem on the client (like the DLL injection
>>> problem Harald mentioned) or by a change to some prominent website
>>> - there are random shocks to the system that happen because of human
>>> behaviors-- news cycles etc changing browsing habits on short time scales
>>>
>>> And so I say this now as a statistician who believes in the value of
>>> good analytical/statistical work: if you *really* want to get to high
>>> confidence about the health of a release, the *only* way to do so is to
>>> reframe releases as RCTs. There is simply too much noise in our timeseries
>>> signals, and we don't have access to enough of the exogenous explanatory
>>> variables to be able to control for that noise. This is not an issue where
>>> you can just pour more stats on it and it will go away; it's irreducible,
>>> and it's why all of Science uses RCTs whenever possible (and also why ton's
>>> of software companies do...).
>>>
>>> Luckily, we control our whole system, so for us it _is_ possible. It
>>> would require some thought about the details of the implementation, and it
>>> would surely take a ton of work from release management, but if we rolled
>>> version out as A/B tests with randomly assigned update vs non-update, we
>>> could know *for sure* what impact new code has on crashes, startup time,
>>> engagement, retention, browser perf, etc etc. [Note: throttling releases is
>>> good for smoke testing this stuff, but it's not the same as real
>>> randomization]
>>>
>>> Anyway, I'm sure there are a million reasons why people won't want to
>>> make a big investment in rethinking giant portions of how release
>>> management works, so in the meantime i'll just leave this suggestion here
>>> to percolate in your brains. Maybe in another two years we'll be sick
>>> enough of all this uncertainty to make that investment; when we get to that
>>> point, ping me, joy, ilana, etc and we can work on the details ;-)
>>>
>>> -bc
>>>
>>>
>>> On Fri, Mar 3, 2017 at 1:52 PM, Benjamin Smedberg <benjamin at smedbergs.us
>>> > wrote:
>>>
>>>> Yes, I'm aware of the fundamental difficulty and that's one of the
>>>> reasons we're prioritizing pingsender.
>>>>
>>>> Can you describe what other work we need to get to a high confidence?
>>>> Especially if there is analysis/statistical help you need, Saptarshi
>>>> already has a lot of context. I want to make sure that we in relatively
>>>> short order *can* answer this in a way that release-drivers can trust to
>>>> make critical ship or no-ship decisions.
>>>>
>>>> -BDS
>>>>
>>>>
>>>> On Wed, Mar 1, 2017 at 3:18 PM, Chris Hutten-Czapski <
>>>> chutten at mozilla.com> wrote:
>>>>
>>>>> "How quickly can we get from [a release] to a reliable crash rate?"
>>>>>
>>>>> Well, if you believe my analysis[1] (which you may, if you'd like) the
>>>>> answer is "at least a day out, but probably best to wait at least a day
>>>>> longer than that"
>>>>>
>>>>> But aside from shameless self-promotion, there's the real concern that
>>>>> I don't actually have a proper model for when we're allowed to trust crash
>>>>> rates. When is the calculated crash rate actually indicative of a release's
>>>>> health? Open question. And one whose answer changes over time, with
>>>>> pingSender changing the speed at which we receive crucial inputs.
>>>>>
>>>>> :chutten
>>>>>
>>>>> [1]: https://chuttenblog.wordpress.com/2017/02/09/data-science-is
>>>>> -hard-client-delays-for-crash-pings/
>>>>>
>>>>> On Wed, Mar 1, 2017 at 3:02 PM, Benjamin Smedberg <
>>>>> benjamin at smedbergs.us> wrote:
>>>>>
>>>>>> Think of the per-build data like this:
>>>>>>
>>>>>> * our crash rate for FF53 b2 is too goddamn high!
>>>>>> * We pulled a topcrash list and found a regression bug
>>>>>> * We fixed it and uplifted it to FF53b4
>>>>>> * Release drivers want to make sure that FF53b4 has the crash rate
>>>>>> reduction that we expected.
>>>>>>
>>>>>> In this case (which is very common on all the prerelease channels),
>>>>>> showing by date and not by build smooths out the signal we care about most,
>>>>>> which is the difference per-build. So what release drivers care about most
>>>>>> is: how quickly can we get from releasing beta8 (or RC1) to a reliable
>>>>>> crash rate that says we're clear to ship this to release?
>>>>>>
>>>>>> --BDS
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 1, 2017 at 2:45 PM, Chris Hutten-Czapski <
>>>>>> chutten at mozilla.com> wrote:
>>>>>>
>>>>>>> Thank you for correcting my mistake about build_id. Apparently when
>>>>>>> I was looking for readable versions I took the lack of a bX suffix on beta
>>>>>>> builds to mean there was no build data, but that was wrong then and wrong
>>>>>>> now.
>>>>>>>
>>>>>>> So yeah. No problemo, apparently. (Well, except my misapprehension,
>>>>>>> which should now be rectified.)
>>>>>>>
>>>>>>> Anyway, back to wishlisting... The request I can most easily
>>>>>>> understand is for the existing display to instead use the current and N
>>>>>>> previous _builds'_ crash counts and kuh to form a "channel health trend
>>>>>>> line". N may be some fixed, small number per channel (maybe 2, 6, 14, 21),
>>>>>>> or some function of usage (take all successive previous builds until they
>>>>>>> contain > Y% of that activity_date's kuh).
>>>>>>>
>>>>>>> This sounds fun and useful and is something I feel I understand well
>>>>>>> enough to file an Issue for and begin working on.
>>>>>>>
>>>>>>> The per-build case on the other hand I feel I understand less well.
>>>>>>> Do the pseudo-timeseries plots as seen on sql.tmo work as well on channels
>>>>>>> with fewer builds? If instead we display the crash rate for one build as
>>>>>>> one number (per type of crash), does it need to be plotted at all? Is
>>>>>>> Harald's view of 52b5 actually a problem? The best display for
>>>>>>> understanding the health of a new release is...
>>>>>>>
>>>>>>> :chutten
>>>>>>>
>>>>>>> On Wed, Mar 1, 2017 at 2:08 PM, Benjamin Smedberg <
>>>>>>> benjamin at smedbergs.us> wrote:
>>>>>>>
>>>>>>>> On Wed, Mar 1, 2017 at 1:40 PM, Chris Hutten-Czapski <
>>>>>>>> chutten at mozilla.com> wrote:
>>>>>>>>
>>>>>>>>> So, for each channel, having a line for each of the current and N
>>>>>>>>> previous versions' crash rates would be helpful? (where N is small... say 2)
>>>>>>>>>
>>>>>>>>
>>>>>>>> It's closer. If it's possible to have a single line that is the
>>>>>>>> aggregate, that might smooth out adoption noise.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> crash_aggregates does its aggregation by version, not build, so if
>>>>>>>>> crash-rates-per-build is necessary this will require quite a rewrite. (I'll
>>>>>>>>> have to join main_summary and crash_summary on dates) If
>>>>>>>>> crash-rates-per-version is sufficient (or is worth the effort of
>>>>>>>>> exploring), then that can be adopted within the existing architecture.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is incorrect. build_id is one of the dimensions, specifically
>>>>>>>> because we needed it for betas with e10s. Example from stmo:
>>>>>>>>
>>>>>>>> activity_date dimensions stats submission_date
>>>>>>>> 24/08/16{"build_id":"20150804030204","os_name":"Windows_NT",
>>>>>>>> "os_version":"6.2","country":"NP","application":"Firefox","a
>>>>>>>> rchitecture":"x86-64","build_version":"42.0a1","channel":"ni
>>>>>>>> ghtly","e10s_enabled":"False"}{"usage_hours_squared":0.01585
>>>>>>>> 779320987654,"main_crashes":0,"content_shutdown_crashes":0,"
>>>>>>>> usage_hours":0.15416666666666667,"content_shutdown_crashes_s
>>>>>>>> quared":0,"gmplugin_crashes":0,"content_crashes":0,"content_
>>>>>>>> crashes_squared":0,"plugin_crashes":0,"plugin_crashes_square
>>>>>>>> d":0,"ping_count":3,"gmplugin_crashes_squared":0,"main_crash
>>>>>>>> es_squared":0}25/08/16
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> One thing to consider may be that crash-rates-per-build will look
>>>>>>>>> messy on anything pre-Beta.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Could be, but here's an example way to present this for nightly:
>>>>>>>> http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-da
>>>>>>>> y-empty-minidump-crashes-per-user/
>>>>>>>>
>>>>>>>> Here's also a draft of something a while back in STMO:
>>>>>>>> https://sql.telemetry.mozilla.org/queries/192/source#309
>>>>>>>>
>>>>>>>> --BDS
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> fhr-dev mailing list
>>>> fhr-dev at mozilla.org
>>>> https://mail.mozilla.org/listinfo/fhr-dev
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> fhr-dev mailing list
> fhr-dev at mozilla.org
> https://mail.mozilla.org/listinfo/fhr-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20170308/f5961469/attachment-0001.html>
More information about the fhr-dev
mailing list