Firefox Accounts DevOps next steps for November
Benson Wong
bwong at mozilla.com
Tue Nov 5 16:46:26 PST 2013
I don't mean to beat a dead horse but this HA talk spurred some research into some of the outages in AWS in the past few years. I know I joke about us-east-1 going down when it snows, but researching through AWS event postmortems are enlightening.
Here's my research:
https://www.evernote.com/shard/s3/sh/40c1bbca-6143-44df-9c74-de4dabbd8109/7f487e75c84b9aa7bfc3ae7ecc27ad12
Some notables:
- us-east-1 seems to get most of the issues.
- a lot of the issues are caused by software bugs or human errors.
- EBS seems to be at the root of a lot of issues, and is usually a bug in the re-mirroring logic
- when ebs fails, RDS is always affected. multi-az RDS fail over seems to work, except when there are bugs
- fail over mostly works as designed, except for when there are bugs
The trend seems to be when a region goes down, a lot of time it is EBS.
Even then, it is usually a black swan like event.
Gene: you region failed over Persona a few times, any insight on what the triggers were for that?
Ben.
----- Original Message -----
From: "Chris Karlof" <ckarlof at mozilla.com>
To: "Ryan Kelly" <rfkelly at mozilla.com>
Cc: "Lloyd Hilaiel" <lhilaiel at mozilla.com>, "Benson Wong" <bwong at mozilla.com>, "Mozilla Services Operations" <services-ops at mozilla.com>, dev-fxacct at mozilla.org, "Gene Wood" <gene at mozilla.com>
Sent: Tuesday, November 5, 2013 3:26:45 PM
Subject: Re: Firefox Accounts DevOps next steps for November
On Nov 5, 2013, at 1:40 PM, Ryan Kelly <rfkelly at mozilla.com> wrote:
> On 5/11/2013 10:58 PM, Lloyd Hilaiel wrote:
>> Not going multi-region from day one makes me nervous. Technology
>> selections which make it harder make me even more nervous. Can we hit
>> HA requirements without it?
>
> Core outstanding question: what are our concrete HA requirements here?
> The clearest operational requirement we currently have is "get this
> thing stood up fast". Hence MySQL, hence RDS.
>
It would be nice to have HA requirements written down. Mayo used to say that 15-20 min failover during a region failure would be fine, but that was before FxA was intended to support the world.
I encourage others to weigh in here. Left to my own devices, I'd lean to taking on more HA risk in the beginning in favor keeping things simple and nimble. I want to aggressively avoid Rube Goldberg complexity foot guns before we even have any users.
-chris
> That said, we haven't taken a Cassandra solution off the table
> completely. We're still working and coding with it in mind for the
> future, but there was enough uncertainty around it to justify removal
> from the critical path to shipping.
>
>
> Ryan
More information about the Dev-fxacct
mailing list