Unified telemetry data outage 2015-06-01 14:22-18:33UTC

Wesley Dawson whd at mozilla.com
Tue Jun 2 01:53:01 UTC 2015

Earlier today we had an outage with the new data pipeline. This is the
first (and hopefully last) full data outage we've had.

The root cause was a combination of a spike in traffic from UT recently
being enabled in beta and a server-side configuration error. The spike
caused the edge nodes to run out of memory, and the configuration error
caused the ELB to replace them with misconfigured nodes. The server
configuration has been corrected and updated to launch instances with more
RAM to mitigate the issue.

Depending on how great of an increase in traffic remains and how spiky the
traffic is, there may be other issues that will need to be sorted out
server-side, but they should not result in a full data outage unless we
receive an unexpectedly large amount of traffic.

Due to said increase in traffic, the demo instance at
https://pipeline-prototype-cep.dev.mozaws.net/ is temporarily unavailable,
but should be back online shortly.

