Suggestions for the new unified FHR/Telemetry/Experiment ping
Vladan Djeric
vdjeric at mozilla.com
Tue Jan 27 01:48:40 PST 2015
This message is going to cover a lot of stuff, so you might want to grab a
cup of coffee now :)
So first, even though it may seem like I'm strongly arguing against the
Telemetry/FHR v4 approach, I'm really trying to figure out how all the
different Telemetry use-cases are going to be affected by the new ping
format. And I'm also trying to figure out a reasonable plan of attack.
I think the biggest "unknown" in the current discussions is the design of
the new Telemetry backend and whether it will be able to stitch together
user sessions and "user-days" efficiently.
I think the backend will require significant changes. The pings will have
to be grouped and stored differently. We'll have to write a LOT of new
backend code for "stitching" pings, feeding the dashboards, supporting
custom analyses, etc. We'll need to create a prototype of the new
Telemetry/FHR backend and test it against realistic data volumes before
landing major client-side changes. Migrating existing Telemetry (and FHR?)
pings to the new backend will be impossible, so we'll have to run two
systems for a while. We'll also need to continue supporting the old
Telemetry (and FHR?) formats for Fennec & B2G.
The timing for this change is bad because Perf team will need Telemetry in
Q1 to evaluate E10S & Flash performance, not to mention other teams relying
on Telemetry and FHR data for their projects. So I think it's a mistake to
risk significant FHR & Telemetry downtime while we re-write the entire
backend.
I think we should try to land the client-side changes gradually.
A rough sketch:
1. Modify the server backend to accept Telemetry pings in the new JSON
format (clientID, sessionID, sysinfo, env, etc). Bump version number.
- Also update the backend to parse E10S data from the Telemetry
payload
- Update dashboard code
2. Modify client to send Telemetry pings in the new JSON format. Rip
out the obsolete "idle-daily" Telemetry ping.
- Continue submitting Telemetry saved-session ping
- Continue to upload FHR data separately
3. Modify the Telemetry backend code to ignore all mid-session Telemetry
pings.
4. Modify Telemetry client-side to create new pings on 24-hour
boundaries. Do not reset any measurements in the middle of a session.
Indicate the last ping of a session explicitly. Bump version number.
- This is essentially my earlier proposal, but as a (hopefully)
temporary measure.
- E10S child Telemetry code will have to be modified to create
subsessions at the same time as parent
5. Adapt Telemetry Experiments backend as needed
6. Modify Telemetry code to create new pings on environment change.
Update Experiments code as needed
7. Modify backend to parse and store the new unified ping format. Stress
test etc.
- All FHR backend functionality should be complete before the next
step
- The FHR user-day stitching code should be finished and tested with
realistic loads
8. Switch FHR to use the Telemetry upload mechanism, and switch over FHR
& Telemetry to the new unified ping format. Reset FHR measurements on new
pings. Bump version number
- At this point, the people working on the FHR backend can stabilize
and improve the backend
- After we have experience with handling the new ping semantics with
FHR, we can move on to converting Telemetry in the next step
9. Assuming success of step 7, write and test session stitching,
extend FHR user-day stitching to Telemetry data, integrate new Telemetry
subsession format with map-reduce and Spark analysis, dashboards,
regression detector, etc.
- Lots of custom stitching rules, e.g. Background-Hang-Reporter data
10. Change Telemetry to reset-on-new-ping semantics. Adapt existing
probes. Bump version number.
- about:telemetry will be a pain to convert. It will have to do
client-side stitching using saved local pings, or display the subsession
separately. Both are bad
- fx-team's UITelemetry reporting will require some special
attention. It's very much session-based and there are no clear rules for
stitching it together from subsessions
- Identify and fix one-per-session histograms
- Document and publicize the new ping model for Telemetry probe
authors
I think this gradual approach would allows us to focus on converting one
backend at a time and to use the experience gained with FHR to convert
Telemetry.
What do you think?
Vladan
On Sun, Jan 25, 2015 at 11:58 AM, Benjamin Smedberg <benjamin at smedbergs.us>
wrote:
>
>
> Georg wrote:
>
> My assumptions was that we will not reset.
>
>
> The proposal as written is that we will reset all the histograms for each
> subsession. Otherwise, realtime dashboard which process incoming pings will
> multipe-count various metrics, and we definitely want to avoid this.
>
> On 1/24/2015 1:35 AM, Vladan Djeric wrote:
>
>
>
> - It will be hard to do per-session analyses
>
>
> I have several responses here:
>
> 1) It will be a bit harder than currently, but I don't think that it will
> be extremely hard. There will be an efficient API to fetch all the pings
> associated with a user, which should make it relatively straightforward to
> stitch together an entire session from its pieces. This is a functional
> requirement for the more qualitative analyses, which will have to stitch
> together an entire user history and not just individual sessions. Doing an
> individual session should be fairly easy.
>
> 2) I treat the session orientation of telemetry as an unfortunate
> limitation, not a desirable property, for almost all of the use cases that
> I've seen. I'd like us to try and move away from reporting metrics based on
> sessions. Can you describe in more detail the use cases where analyzing
> data by session is preferable to analyzing by some constant denominator? We
> should be willing to use both clock time and activeTicks as denominators,
> and these denominators can both be calculated looking at individual
> subsession pings.
>
> 3) For the case of the current telemetry dashboard, I'd like to understand
> why simply replacing the current whole-session analysis with the new
> subsessions would produce statistically worse results than the current
> session-based analysis.
>
>
> - Many of the 1000+ Telemetry measurements are inherently
> "per-session" and can't meaningfully be split into session fragments:
> - Flag histograms
> <https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type> track
> feature usage per-session.
> - They are automatically initialized to a value of "false" at the
> beginning of a session, and can only be set to "true" once.
> - If we reset Telemetry measurements every time we create a new
> ping, we'll be reporting nonsense: pings from the same session will
> contradict each other on whether a feature was ever used during the session
> - This would feed bad data to both the dashboards and any
> custom analyses
>
> *If* you really care about this per-session, why can't you just take
> "true" from any of the subsessions as an indication that it's true for the
> entire session?
>
> And if we just report by subsession, how is this much different from the
> skew that we already have between users who have lots of short sessions and
> users that keep their browser open for days or weeks?
>
> Maybe this just indicates that we're mis-using histograms for
> non-aggregate measurements, and we should just have a separate list of flag
> metrics which are treated differently.
>
>
> - Count histograms
> <https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type> are
> also per-session measurements. You can't aggregate a count-histogram value
> from the middle of a session together with final values from other sessions
>
>
> Won't summing across the subsessions get you the total count for the
> session?
>
>
> - For custom analyses, we sometimes want to correlate measurements
> from the beginning of a session with measurements from the end of a session
> (which could have lasted several days), e.g. histograms related to startup
> performance vs later performance
> - We would need that messy server-side session reconstruction
> process to get at per-session data.
> - More generally, a ping generated as a result of local time &
> environment changes is not inherently meaningful to us, unlike a full user
> session
>
>
> I don't understand this case. Assuming session stitching works, which is a
> general requirement for all sorts of analyses, this should work no worse
> than currently, and you potentially have finer-grain data on the subsequent
> days if that's useful.
>
>
> - Resetting Telemetry and FHR data when a TelemetryExperiment begins
> removes valuable context from the experiment ping. It's possible to
> reconstruct it, but that's yet another server-side job to run
>
> I don't understand this. Is this also assuming that stitching is
> expensive?
>
>
> - There's overhead from sending a new ping for each mid-session
> environment change
> - There's also a small privacy issue with creating ordered,
> fine-grained reports of user actions, e.g. when a user goes through their
> add-ons list and disables 5 addons, we report each user action
> - Either coalesce successive environment-change pings, or carefully
> vet which mid-session environment changes generate a new ping
>
>
> I think it's worth considering whether there's a window of time where
> multiple changes get coalesced. But I'm not particularly worried about the
> privacy problem, since we do in fact want to record when users disable
> addons.
>
> I'd like to propose that we implement the following modifications to the
> FHR/Telemetry v4 document:
>
> 1. Do not reset *Telemetry* measurements when a session crosses the
> 24-hour boundary
> - Continue to "reset" Telemetry measurements when we start a new
> session
> - There's no need to reset Telemetry on most environment changes
> (e.g. amount of memory installed) since those can't happen without a
> Firefox restart anyway.
> 2. Record mid-session environment changes (add-ons and
> TelemetryExperiments) in a special section in the ping.
> - For each such environment change, document the change in the
> section and also attach a snapshot of the Telemetry & FHR data at the time
> of the change
> - After the snapshot is saved, reset Telemetry and FHR measurements
> for the current session. In other words, snapshot & then build up a diff
> - For each additional environment change during the same session,
> just repeat and append to the new section
> - Telemetry backend scripts (dashboard, regression detector etc)
> can just ignore experiment/add-on change pings
>
> This model has some nice properties:
>
> - The *final ping* of a session is equivalent to a Telemetry
> saved-session ping
> - Per-session analyses are as easy to do as before
> - No need to run any session reconstruction jobs!
> - Every main ping submitted is meaningful without needing any
> reconstruction steps. All pings will contain the current FHR state + all
> the Telemetry measurements from the current session
> - Most pings will only have one environment change, so the relevant
> measurements that happened after the change are all going to be in the
> regular Telemetry/FHR section
> - However, when deeper analysis is required, Experiment pings will
> also have information about what was happening BEFORE the experiment began
> - Analyzing pings with multiple environment changes won't be much
> harder
>
>
> I feel like this proposal is optimizing for the wrong things.
>
> You are making a distinction between "Telemetry" measurements and other
> measurements in a way which I am specifically trying to avoid. The goal is
> to use the common histogram system for everything. At least some of those
> measurements must be distinguished by subsession. I explicitly want to get
> rid of the current situation where "telemetry metrics" are treated one way,
> and "FHR metrics" are treated in some entirely separate manner. We want to
> be able to use the standard histograms/keyed histograms for almost
> everything.
>
> For the simple things like the telemetry dashboard, I believe that doing
> all analysis by subsession is good enough (no worse than the current
> situation). For more complex queries , both stitching together an entire
> session and stitching together the history per-user will not only be
> possible but should be fairly efficient.
>
> --BDS
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20150127/d52ccf9d/attachment-0001.html>
More information about the fhr-dev
mailing list