Firefox Accounts DevOps next steps for November

Benson Wong bwong at
Tue Nov 5 16:46:26 PST 2013

I don't mean to beat a dead horse but this HA talk spurred some research into some of the outages in AWS in the past few years. I know I joke about us-east-1 going down when it snows, but researching through AWS event postmortems are enlightening. 

Here's my research:

Some notables: 

- us-east-1 seems to get most of the issues. 
- a lot of the issues are caused by software bugs or human errors. 
- EBS seems to be at the root of a lot of issues, and is usually a bug in the re-mirroring logic
- when ebs fails, RDS is always affected. multi-az RDS fail over seems to work, except when there are bugs
- fail over mostly works as designed, except for when there are bugs

The trend seems to be when a region goes down, a lot of time it is EBS. 
Even then, it is usually a black swan like event. 

Gene: you region failed over Persona a few times, any insight on what the triggers were for that? 


----- Original Message -----
From: "Chris Karlof" <ckarlof at>
To: "Ryan Kelly" <rfkelly at>
Cc: "Lloyd Hilaiel" <lhilaiel at>, "Benson Wong" <bwong at>, "Mozilla Services Operations" <services-ops at>, dev-fxacct at, "Gene Wood" <gene at>
Sent: Tuesday, November 5, 2013 3:26:45 PM
Subject: Re: Firefox Accounts DevOps next steps for November

On Nov 5, 2013, at 1:40 PM, Ryan Kelly <rfkelly at> wrote:

> On 5/11/2013 10:58 PM, Lloyd Hilaiel wrote:
>> Not going multi-region from day one makes me nervous.  Technology
>> selections which make it harder make me even more nervous.  Can we hit
>> HA requirements without it?  
> Core outstanding question: what are our concrete HA requirements here?
> The clearest operational requirement we currently have is "get this
> thing stood up fast".  Hence MySQL, hence RDS.

It would be nice to have HA requirements written down. Mayo used to say that 15-20 min failover during a region failure would be fine, but that was before FxA was intended to support the world.

I encourage others to weigh in here. Left to my own devices, I'd lean to taking on more HA risk in the beginning in favor keeping things simple and nimble. I want to aggressively avoid Rube Goldberg complexity foot guns before we even have any users.


> That said, we haven't taken a Cassandra solution off the table
> completely.  We're still working and coding with it in mind for the
> future, but there was enough uncertainty around it to justify removal
> from the critical path to shipping.
>  Ryan

More information about the Dev-fxacct mailing list