Firefox Accounts DevOps next steps for November

Benson Wong bwong at mozilla.com
Tue Nov 5 16:46:26 PST 2013


I don't mean to beat a dead horse but this HA talk spurred some research into some of the outages in AWS in the past few years. I know I joke about us-east-1 going down when it snows, but researching through AWS event postmortems are enlightening. 

Here's my research: 

https://www.evernote.com/shard/s3/sh/40c1bbca-6143-44df-9c74-de4dabbd8109/7f487e75c84b9aa7bfc3ae7ecc27ad12

Some notables: 

- us-east-1 seems to get most of the issues. 
- a lot of the issues are caused by software bugs or human errors. 
- EBS seems to be at the root of a lot of issues, and is usually a bug in the re-mirroring logic
- when ebs fails, RDS is always affected. multi-az RDS fail over seems to work, except when there are bugs
- fail over mostly works as designed, except for when there are bugs

The trend seems to be when a region goes down, a lot of time it is EBS. 
Even then, it is usually a black swan like event. 

Gene: you region failed over Persona a few times, any insight on what the triggers were for that? 

Ben.


----- Original Message -----
From: "Chris Karlof" <ckarlof at mozilla.com>
To: "Ryan Kelly" <rfkelly at mozilla.com>
Cc: "Lloyd Hilaiel" <lhilaiel at mozilla.com>, "Benson Wong" <bwong at mozilla.com>, "Mozilla Services Operations" <services-ops at mozilla.com>, dev-fxacct at mozilla.org, "Gene Wood" <gene at mozilla.com>
Sent: Tuesday, November 5, 2013 3:26:45 PM
Subject: Re: Firefox Accounts DevOps next steps for November


On Nov 5, 2013, at 1:40 PM, Ryan Kelly <rfkelly at mozilla.com> wrote:

> On 5/11/2013 10:58 PM, Lloyd Hilaiel wrote:
>> Not going multi-region from day one makes me nervous.  Technology
>> selections which make it harder make me even more nervous.  Can we hit
>> HA requirements without it?  
> 
> Core outstanding question: what are our concrete HA requirements here?
> The clearest operational requirement we currently have is "get this
> thing stood up fast".  Hence MySQL, hence RDS.
> 

It would be nice to have HA requirements written down. Mayo used to say that 15-20 min failover during a region failure would be fine, but that was before FxA was intended to support the world.

I encourage others to weigh in here. Left to my own devices, I'd lean to taking on more HA risk in the beginning in favor keeping things simple and nimble. I want to aggressively avoid Rube Goldberg complexity foot guns before we even have any users.

-chris


> That said, we haven't taken a Cassandra solution off the table
> completely.  We're still working and coding with it in mind for the
> future, but there was enough uncertainty around it to justify removal
> from the critical path to shipping.
> 
> 
>  Ryan




More information about the Dev-fxacct mailing list