Wednesday, March 29, 2017

Introducing Miranda: 9s of Reliability

In preparation for a talk I'm giving at DOSUG I'm going to post my thoughts as they develop.

What do the various "9s" of reliability mean?  Here is a table with the amount of yearly  downtime that each level allows:

# of Nines Percenatgae Time
1 90% A month
2 99% 3 days
3 99.9% 8.8 hours
4 99.99% 52.6 Minutes
5 99.999% 5.3 Minutes
6 99.9999% 32 Seconds
7 99.99999% 3.2 Seconds
8 99.999999% 320 Milliseconds
9 99.9999999% 32 Milliseconds

So 1 9 any team could do if they have a working system.  If they even have a pager the person with it does not pay a lot of attention to it.

If the company is serious about it 2 9s is no longer a joke.  They definitely have a pager that people trade off on a weekly basis.  When the pager goes off. the person tries to respond in 20 minutes or less.  The person on call has the phone numbers of the rest of the team in case they need something.

At 3 9s, there is definitely someone with a pager and there have been very serious conversations about getting a control center, a la NASA, for the system. The system may or may not be distributed.

At 4 9s, there is a control room manned by very humor limited folks who have the numbers of each of the team members, and when something goes wrong, they call them.  The person on call switches off weekly, reminiscent of wearing a pager, and if there isn't a hot standby then there have been very serious conversations as to why not.

At 5 9s, there is a control room, manned be people who were too serious for the previous level.  The control person's duty is to switch over the system to a hot standby and to call the on-call person when a problem develops.  Being on-call is no joke, and when the pager goes off, the on-call person must respond within 10 minutes.

At 6 9s, things are insane.  The people who were too serious for 5 9s are manning the control room. There is a person whose sole duty is to decide whether to switch to the hot standby.  There are two levels of people on call, and each of them must respond within 5 minutes.

Levels 7,8 and 9 require varying levels of hardware to support them, and there is still a control room and lists of on-call people.  But the question becomes, if you are serious, what is this for?  At 7 or 8 9s it becomes hard to tell if the system has been down.  After all, people's internet connections are not 7 or 8 9s reliable.

Miranda Takes Systems with 1 or 2 9s and Makes Them Appear to have Between 5 and 6 Nines

Miranda is a distributed, fault-tolerant system designed to run behind a load balancer that accepts messages on behalf of the underlying system and delivers them when the system is up.

The underlying system can be down for hours or days while Miranda accepts message for it. Therefore, the system appears to have between 5 and 6 9s of reliability, when it has fewer than that.

No comments:

Post a Comment