<div dir="ltr">Security hotfixes are one thing, and in those cases we probably have slightly different priorities (we may be willing to rush a security fix even if it has a minor impact on perf, stability, etc) -- but, per the main thrust of this thread: in the general, non-emergency case, if we want to know with high confidence how a new build impacts metrics we care about (stability, perf, retention, etc etc), really the only way to get that confidence is by doing a pretty rigorous RCT. And as we well know, often times things appear on Release that we never caught on Beta or before, which, in the worst case, can lead to chemspills that affect the entire Release population. Throttled roll-out to Release is much better than nothing, but we also know that the population of fast updaters is different from the general population, so throttled roll-outs will not produce the same answers as a real RCT. Hence:<br><br>**Anytime we release code without doing so as a real randomly controlled experiment, we're doing an uncontrolled experiment
on our entire user population**<br><br>;-)<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Mar 8, 2017 at 11:54 AM, Marco Castelluccio <span dir="ltr"><<a href="mailto:mcastelluccio@mozilla.com" target="_blank">mcastelluccio@mozilla.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>This sounds interesting, but is it feasible? Can we hold users
back and prevent them from updating to a possibly more secure
version?</p>
<p>How many days would it take to have results from such a test? We
currently roll out updates to 25% just for a few days.<br>
</p>
<p>I can see us doing this for Aurora/Beta, but I'm less sure about
Release.</p>
<p>- Marco.<br>
</p>
<br>
<div class="m_270149927707753616moz-cite-prefix">Il 08/03/17 20:40, Saptarshi Guha ha
scritto:<br>
</div><div><div class="h5">
<blockquote type="cite">
<div id="m_270149927707753616compose-container"><span><span content="Outlook Mobile for iOS"></span></span>
<div>
<div>I've been saying something similar for a long time now.
Ii advocate for UT based rollout that can be controlled
based on variables in UT and unify the experiments and
release rollout thought processes..</div>
<div><br>
From an email I sent elsewhere (appropriate here)</div>
<div><br>
</div>
<div>
<div class="gmail_default"><span style="background-color:rgba(255,255,255,0)">I also advocate (strongly!)
doing releases based entirely on UT data. A release (and
even an experiment) can be released to profiles meeting
some criteria e.g. Release, <a class="m_270149927707753616moz-txt-link-freetext">geo:US</a>, locale: en-us
memory: low-end *and* sampleid1 in 1, sampleid2 in 41.</span></div>
<div class="gmail_default"><span style="background-color:rgba(255,255,255,0)">Here sampleid2 are independent
versions of sample_id (e.g. crc32(salt(client_id)) %%
100 ). That way we can release to as small as 60,000
profiles. Moreover we can keep a file with this release
history (what filters, what sampleids chosen etc) and
keep it in the UT json itself. </span></div>
<div class="gmail_default"><span>The quality of
the release can be viewed in two ways: effect on
existing users (before vs after ) and test vs control.</span></div>
<div class="gmail_default"><span><br>
</span></div>
<div class="gmail_default"><span>Cheers</span></div>
<div class="gmail_default"><span>Saptarshi</span></div>
<br>
</div>
</div>
</div>
<br>
<br>
<br>
<div class="gmail_quote">On Mon, Mar 6, 2017 at 11:57 AM -0800,
"Brendan Colloran" <span dir="ltr"><<a href="mailto:bcolloran@mozilla.com" target="_blank">bcolloran@mozilla.com</a>></span> wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="3D"ltr"">
<div dir="ltr">
<div>
<div>
<div>
<div>"""I don't actually have a proper model for
when we're allowed to trust crash rates. When is
the calculated crash rate actually indicative of a
release's health? Open question. And one whose
answer changes over time, with pingSender changing
the speed at which we receive crucial inputs."""<br>
<br>
"""Can you describe what other work we need to get
to a high confidence? Especially if there is
analysis/statistical help you need, Saptarshi
already has a lot of context. I want to make sure
that we in relatively short order *can* answer
this in a way that release-drivers can trust to
make critical ship or no-ship decisions."""<br>
<br>
Great conversation, important issues. I'm going to
repeat a slogan that I've been saying for years,
because recent progress at Mozilla indicates that
doing so has actually been effective ;-)<br>
<br>
</div>
**Anytime we release code without doing so as a real
randomly controlled experiment, we're doing an
uncontrolled experiment on our entire user
population**<br>
<br>
It is not a mistake that Randomized Controlled
Trials are the gold standard for *nailing down*
causal relationships in Science. Over the past I
dunno 18 months-ish, Mozilla has gotten way better
at using RCTs to gain real clarity about the impact
of changes, but we still don't do it for
everything... the final frontier is doing RCTs for
entire builds/releases.<br>
<br>
</div>
<div>As you all are very well aware, when you are
watching things like crash numbers or start-up times
bounce around from day to day and version to
version, it's super hard to get clarity about what
is causing the bounces-- there are a ton of
confounding variables. At a minimum:<br>
</div>
</div>
<div>- there is the effect of builds/releases, which is
the thing we can control, and hence the thing whose
effect we really actually care about<br>
</div>
<div>- there are seasonal and weekly cycles that are
pretty well understood, but which (as anyone has
looked at this stuff knows...) can still be pretty
messy, and which impacts different parts of out
population differently (e.g. heavy vs light users)<br>
- there are random shocks to the the signal that are
caused by some change in the software ecosystem on the
client (like the DLL injection problem Harald
mentioned) or by a change to some prominent website<br>
</div>
<div>- there are random shocks to the system that happen
because of human behaviors-- news cycles etc changing
browsing habits on short time scales<br>
</div>
<div><br>
</div>
<div>And so I say this now as a statistician who
believes in the value of good analytical/statistical
work: if you *really* want to get to high confidence
about the health of a release, the *only* way to do so
is to reframe releases as RCTs. There is simply too
much noise in our timeseries signals, and we don't
have access to enough of the exogenous explanatory
variables to be able to control for that noise. This
is not an issue where you can just pour more stats on
it and it will go away; it's irreducible, and it's why
all of Science uses RCTs whenever possible (and also
why ton's of software companies do...).<br>
<br>
</div>
<div>Luckily, we control our whole system, so for us it
_is_ possible. It would require some thought about the
details of the implementation, and it would surely
take a ton of work from release management, but if we
rolled version out as A/B tests with randomly assigned
update vs non-update, we could know *for sure* what
impact new code has on crashes, startup time,
engagement, retention, browser perf, etc etc. [Note:
throttling releases is good for smoke testing this
stuff, but it's not the same as real randomization]<br>
<br>
</div>
<div>Anyway, I'm sure there are a million reasons why
people won't want to make a big investment in
rethinking giant portions of how release management
works, so in the meantime i'll just leave this
suggestion here to percolate in your brains. Maybe in
another two years we'll be sick enough of all this
uncertainty to make that investment; when we get to
that point, ping me, joy, ilana, etc and we can work
on the details ;-)<br>
<br>
</div>
<div>-bc<br>
</div>
<br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Fri, Mar 3, 2017 at 1:52 PM,
Benjamin Smedberg <span dir="ltr"><<a href="mailto:benjamin@smedbergs.us" target="_blank">benjamin@smedbergs.us</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>Yes, I'm aware of the fundamental
difficulty and that's one of the reasons we're
prioritizing pingsender.<br>
<br>
</div>
Can you describe what other work we need to get
to a high confidence? Especially if there is
analysis/statistical help you need, Saptarshi
already has a lot of context. I want to make
sure that we in relatively short order *can*
answer this in a way that release-drivers can
trust to make critical ship or no-ship
decisions.<br>
<br>
</div>
-BDS<br>
<br>
</div>
<div class="m_270149927707753616m_516997586648512450HOEnZb">
<div class="m_270149927707753616m_516997586648512450h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Mar 1, 2017
at 3:18 PM, Chris Hutten-Czapski <span dir="ltr"><<a href="mailto:chutten@mozilla.com" target="_blank">chutten@mozilla.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>"How quickly can we get from [a
release] to a reliable crash rate?"<br>
<br>
</div>
Well, if you believe my analysis[1]
(which you may, if you'd like) the
answer is "at least a day out, but
probably best to wait at least a day
longer than that"<br>
<br>
</div>
<div>But aside from shameless
self-promotion, there's the real
concern that I don't actually have a
proper model for when we're allowed to
trust crash rates. When is the
calculated crash rate actually
indicative of a release's health? Open
question. And one whose answer changes
over time, with pingSender changing
the speed at which we receive crucial
inputs.<br>
</div>
<div><br>
</div>
<div>:chutten<br>
</div>
<div><br>
[1]: <a href="https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/" target="_blank">https://chuttenblog.wordpress.<wbr>com/2017/02/09/data-science-is<wbr>-hard-client-delays-for-crash-<wbr>pings/</a></div>
</div>
<div class="m_270149927707753616m_516997586648512450m_-9189569570906763126HOEnZb">
<div class="m_270149927707753616m_516997586648512450m_-9189569570906763126h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Mar
1, 2017 at 3:02 PM, Benjamin
Smedberg <span dir="ltr"><<a href="mailto:benjamin@smedbergs.us" target="_blank">benjamin@smedbergs.us</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>
<div>Think of the
per-build data
like this:<br>
<br>
</div>
* our crash rate for
FF53 b2 is too
goddamn high!<br>
</div>
* We pulled a topcrash
list and found a
regression bug<br>
</div>
* We fixed it and
uplifted it to FF53b4<br>
</div>
* Release drivers want to
make sure that FF53b4 has
the crash rate reduction
that we expected.<br>
<br>
</div>
In this case (which is very
common on all the prerelease
channels), showing by date
and not by build smooths out
the signal we care about
most, which is the
difference per-build. So
what release drivers care
about most is: how quickly
can we get from releasing
beta8 (or RC1) to a reliable
crash rate that says we're
clear to ship this to
release?<br>
<br>
</div>
--BDS<br>
<div><br>
</div>
</div>
<div class="m_270149927707753616m_516997586648512450m_-9189569570906763126m_-7488825529247315431HOEnZb">
<div class="m_270149927707753616m_516997586648512450m_-9189569570906763126m_-7488825529247315431h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On
Wed, Mar 1, 2017 at 2:45
PM, Chris Hutten-Czapski
<span dir="ltr"><<a href="mailto:chutten@mozilla.com" target="_blank">chutten@mozilla.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>Thank you
for correcting
my mistake
about
build_id.
Apparently
when I was
looking for
readable
versions I
took the lack
of a bX suffix
on beta builds
to mean there
was no build
data, but that
was wrong then
and wrong now.<br>
<br>
</div>
So yeah. No
problemo,
apparently.
(Well, except
my
misapprehension,
which should
now be
rectified.)<br>
<br>
</div>
<div>Anyway,
back to
wishlisting...
The request I
can most
easily
understand is
for the
existing
display to
instead use
the current
and N previous
_builds'_
crash counts
and kuh to
form a
"channel
health trend
line". N may
be some fixed,
small number
per channel
(maybe 2, 6,
14, 21), or
some function
of usage (take
all successive
previous
builds until
they contain
> Y% of
that
activity_date's
kuh).<br>
<br>
</div>
<div>This
sounds fun and
useful and is
something I
feel I
understand
well enough to
file an Issue
for and begin
working on.<br>
<br>
</div>
<div>The
per-build case
on the other
hand I feel I
understand
less well. Do
the
pseudo-timeseries
plots as seen
on sql.tmo
work as well
on channels
with fewer
builds? If
instead we
display the
crash rate for
one build as
one number
(per type of
crash), does
it need to be
plotted at
all? Is
Harald's view
of 52b5
actually a
problem? The
best display
for
understanding
the health of
a new release
is... <br>
</div>
<div><br>
</div>
<div>:chutten<br>
</div>
</div>
</div>
</div>
</div>
<div class="m_270149927707753616m_516997586648512450m_-9189569570906763126m_-7488825529247315431m_8443718441706384702HOEnZb">
<div class="m_270149927707753616m_516997586648512450m_-9189569570906763126m_-7488825529247315431m_8443718441706384702h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On
Wed, Mar 1,
2017 at 2:08
PM, Benjamin
Smedberg <span dir="ltr"><<a href="mailto:benjamin@smedbergs.us" target="_blank">benjamin@smedbergs.us</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote"><span>On
Wed, Mar 1,
2017 at 1:40
PM, Chris
Hutten-Czapski
<span dir="ltr"><<a href="mailto:chutten@mozilla.com" target="_blank">chutten@mozilla.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>
<div>So, for
each channel,
having a line
for each of
the current
and N previous
versions'
crash rates
would be
helpful?
(where N is
small... say
2)<br>
</div>
</div>
</div>
</blockquote>
</span>
<div><br>
It's closer.
If it's
possible to
have a single
line that is
the aggregate,
that might
smooth out
adoption
noise.<br>
</div>
<span>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div><br>
<div>crash_aggregates
does its
aggregation by
version, not
build, so if
crash-rates-per-build
is necessary
this will
require quite
a rewrite.
(I'll have to
join
main_summary
and
crash_summary
on dates) If
crash-rates-per-version
is sufficient
(or is worth
the effort of
exploring),
then that can
be adopted
within the
existing
architecture.<br>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
</span>
<div>This is
incorrect.
build_id is
one of the
dimensions,
specifically
because we
needed it for
betas with
e10s. Example
from stmo:<br>
<br>
activity_date
dimensions
stats
submission_date
24/08/16{"build_id":"201508040<wbr>30204","os_name":"Windows_NT",<wbr>"os_version":"6.2","country":"<wbr>NP","application":"Firefox","a<wbr>rchitecture":"x86-64","build_v<wbr>ersion":"42.0a1","channel":"ni<wbr>ghtly","e10s_enabled":"False"}<wbr>{"usage_hours_squared":0.01585<wbr>779320987654,"main_crashes":0,<wbr>"content_shutdown_crashes":0,"<wbr>usage_hours":0.154166666666666<wbr>67,"content_shutdown_crashes_s<wbr>quared":0,"gmplugin_crashes":0<wbr>,"content_crashes":0,"content_<wbr>crashes_squared":0,"plugin_cra<wbr>shes":0,"plugin_crashes_square<wbr>d":0,"ping_count":3,"gmplugin_<wbr>crashes_squared":0,"main_crash<wbr>es_squared":0}25/08/16<br>
<br>
<br>
</div>
<span>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>
<div><br>
</div>
<div>One thing
to consider
may be that
crash-rates-per-build
will look
messy on
anything
pre-Beta. <br>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
</span>
<div>Could be,
but here's an
example way to
present this
for nightly: <a href="http://benjamin.smedbergs.us/blog/2013-04-22/graph-of-the-day-empty-minidump-crashes-per-user/" target="_blank">http://benjamin.smedbergs.us/b<wbr>log/2013-04-22/graph-of-the-da<wbr>y-empty-minidump-crashes-per-u<wbr>ser/</a><br>
<br>
</div>
<div>Here's
also a draft
of something a
while back in
STMO: <a href="https://sql.telemetry.mozilla.org/queries/192/source#309" target="_blank">https://sql.telemetry.mozilla.<wbr>org/queries/192/source#309</a><br>
<br>
</div>
<div>--BDS</div>
</div>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
<br>
______________________________<wbr>_________________<br>
fhr-dev mailing list<br>
<a href="mailto:fhr-dev@mozilla.org" target="_blank">fhr-dev@mozilla.org</a><br>
<a href="https://mail.mozilla.org/listinfo/fhr-dev" rel="noreferrer" target="_blank">https://mail.mozilla.org/listi<wbr>nfo/fhr-dev</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br></div>