Suggestions for the new unified FHR/Telemetry/Experiment ping

Vladan Djeric vdjeric at mozilla.com
Fri Jan 23 22:35:21 PST 2015


Hi all,

I've been thinking about Telemetry and FHR unification, and I think we have
to be careful around the new ping semantics. Client-side changes require
backend architecture changes, and both ultimately determine which analyses
are convenient or even feasible.

To recap, we are merging 3 measurement systems with different semantics:

   - Telemetry's measurements are implicitly "*per session*". Telemetry
   creates a new ping a the beginning of every Firefox session and records a
   "saved-session" ping at the end of the session.
      - *There is also an "idle-daily" ping sent during the session at most
      every 24 hours, but there are backend problems so idle-dailies
      are currently not being used for anything*
   - FHR's reporting of user activity & browser state is mostly with
   respect to *calendar days*
   - TelemetryExperiments focus on differences between the test group and
   the control group

We wish to unify the FHR & Telemetry pings into a single ping and make data
collected during A/B TelemetryExperiments more precise.

*So first off, let me describe my intepretation of the current unification
proposal
<https://docs.google.com/document/d/1IGpzsYGi_sq3YFQDAPyKOkU_BKvXAC95fZYA2i4ceVs/edit#>:*

   - Whenever a new ping is started, all the FHR & Telemetry measurements
   for the current session will be reset
   - In the new system, Firefox starts collecting a new ping whenever:
      1. a new Firefox session is started
      2. a new day has begun (not sure if it's every 24 hours of uptime, or
      if it's based on local time e.g. midnight local time)
      3. whenever the Firefox "environment" changes
         - Examples: a user enables or disables an addon, the graphics
         driver is updated, an A/B experiment begins or ends (this
happens in the
         middle of a session), Firefox HW acceleration is disabled, etc

This proposal has some nice properties unique to it:

   - Relatively straightforward to implement on the client
   - No duplicated measurements (the existing Telemetry "saved-session"
   pings duplicate the measurements in the "idle-daily" pings)
   - Data is sent to Mozilla servers quickly (no ping covers more than a
   24-hour period, so no waiting for a session to end)

However, I think these semantics would create a lot of problems for
Telemetry analysis:

   - It will be hard to do per-session analyses
      - It's going to be hard to re-construct sessions from session
      fragments on the Telemetry server
         - A 2-week long session will have at least 14 pings scattered
         across 14 daily archives
      - The reconstruction process could get messy and fragile
      - Code for merging different histogram types, dealing with missing
         session fragments, storing and updating the merged pings, backfills,
         correcting errors, etc.
         - For FHR's needs, I think this is unavoidable (and easier!),
         unless we go back to submitting 6-month user histories in
every ping as FHR
         does now
         - It's better not to do these kinds of reconstruction jobs unless
         if we absolutely have to
      - Many of the 1000+ Telemetry measurements are inherently
   "per-session" and can't meaningfully be split into session fragments:
   - Flag histograms
      <https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type>
track
      feature usage per-session.
      - They are automatically initialized to a value of "false" at the
         beginning of a session, and can only be set to "true" once.
         - If we reset Telemetry measurements every time we create a new
         ping, we'll be reporting nonsense: pings from the same session will
         contradict each other on whether a feature was ever used
during the session
         - This would feed bad data to both the dashboards and any custom
            analyses
            - Count histograms
      <https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type>
are
      also per-session measurements. You can't aggregate a
count-histogram value
      from the middle of a session together with final values from
other sessions
      - You might be saying: "So only report those histogram types in the
      final ping of the session!"
         - We don't know all the histograms that need this treatment. Other
         histogram types are being used to represent per-session
measurements such
         as configuration settings or feature usage, e.g.
"CANVAS_2D_USED" boolean
         histogram
         - Some keyed-histograms have per-session semantics, some don't
         - Some Telemetry users want measurements expressed in "per
         session" terms and those measurements aren't necessarily in
count & flag
         histograms
         - See next point about custom analyses
      - For custom analyses, we sometimes want to correlate measurements
   from the beginning of a session with measurements from the end of a session
   (which could have lasted several days), e.g. histograms related to startup
   performance vs later performance
      - We would need that messy server-side session reconstruction process
      to get at per-session data.
      - More generally, a ping generated as a result of local time &
      environment changes is not inherently meaningful to us, unlike a
full user
      session
   - Resetting Telemetry and FHR data when a TelemetryExperiment begins
   removes valuable context from the experiment ping. It's possible to
   reconstruct it, but that's yet another server-side job to run
   - There's overhead from sending a new ping for each mid-session
   environment change
      - There's also a small privacy issue with creating ordered,
      fine-grained reports of user actions, e.g. when a user goes through their
      add-ons list and disables 5 addons, we report each user action
      - Either coalesce successive environment-change pings, or carefully
      vet which mid-session environment changes generate a new ping

I'd like to propose that we implement the following modifications to the
FHR/Telemetry v4 document:

   1. Do not reset *Telemetry* measurements when a session crosses the
   24-hour boundary
      - Continue to "reset" Telemetry measurements when we start a new
      session
      - There's no need to reset Telemetry on most environment changes
      (e.g. amount of memory installed) since those can't happen without a
      Firefox restart anyway.
   2. Record mid-session environment changes (add-ons and
   TelemetryExperiments) in a special section in the ping.
      - For each such environment change, document the change in the
      section and also attach a snapshot of the Telemetry & FHR data
at the time
      of the change
      - After the snapshot is saved, reset Telemetry and FHR measurements
      for the current session. In other words, snapshot & then build up a diff
      - For each additional environment change during the same session,
      just repeat and append to the new section
      - Telemetry backend scripts (dashboard, regression detector etc) can
      just ignore experiment/add-on change pings

This model has some nice properties:

   - The *final ping* of a session is equivalent to a Telemetry
   saved-session ping
      - Per-session analyses are as easy to do as before
   - No need to run any session reconstruction jobs!
   - Every main ping submitted is meaningful without needing any
   reconstruction steps. All pings will contain the current FHR state + all
   the Telemetry measurements from the current session
   - Most pings will only have one environment change, so the relevant
   measurements that happened after the change are all going to be in the
   regular Telemetry/FHR section
   - However, when deeper analysis is required, Experiment pings will also
   have information about what was happening BEFORE the experiment began
   - Analyzing pings with multiple environment changes won't be much harder

Admittedly, there is a trade-off to not resetting Telemetry after the
24-hour period.

   - Since each main ping submitted will contain Telemetry data from the
   start of the session, getting Telemetry data collected over a single day
   will be hard. I think this is an acceptable tradeoff.

I want to mention a few other solutions, but these are not as appealing:

   - Collect and submit both per-session and per-ping Telemetry data...
   This doubles Telemetry run-time memory use
   - Reconstruct sessions by merging saved pings on the client-side... I
   think this would be a mess
   - Have TelemetryExperiments take effect on restart instead of
   mid-session... This biases the experiment data against longer-running
   sessions, and addons would still be an issue


Let me know what you think.

Thank you,

Vladan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/fhr-dev/attachments/20150124/4f746555/attachment.html>


More information about the fhr-dev mailing list