How many Nines?

November 10, 2010

A cost/benefit analysis of reliability and engineering tradeoffs.

Availability refers to the ability of the user community to access the system to do work. If a user cannot access the system, it is said to be unavailable.

A typical SLA for many services is expressed in nines (9s) of availability. For example, here is the AWS S3 SLA. Before we proceed further, I’d like to lay note the precise seconds of downtime per “9.” Source of the table below is Wikipedia.

Availability % Downtime per year Downtime per month* Downtime per week
90% (“one nine”) 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% (“two nines”) 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.9% (“three nines”) 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% (“four nines”) 52.56 minutes 4.32 minutes 1.01 minutes
99.999% (“five nines”) 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% (“six nines”) 31.5 seconds 2.59 seconds 0.605 seconds

* For monthly calculations, a 30-day month is used.

Lets say your system is responsible for a gross revenue of 12 billion USD a year and the revenue processing is equally distributed over time (safer assumption for a global company, but if it is not, change the model to account for peak hours usage). If the system uptime is five 9s (99.999%), should you spend the money to go to six 9s (99.9999%) availability? The answer is no.

Revenue/Year is 12B: 12,000,000,000.00.

Revenue/Second is 380.27

Lost Revenue at 5-9s availability is 120,000 and lost revenue at 6-9s is 12,000. The delta is 108,000. Which is less than 1FTE (loaded cost). So there is no reason to go to 6 nines at that revenue number if it requires more than 1 FTE worth of work to be done. Full model is here.

Sean M. Doran (smd) had some interesting insights as well. I’ve added them here with a bit of editing.

It’s an actuarial issue, and you can further trade off an SLA arrangement with business interruption insurance from a third party (which might be cheaper!).

We (FSVO “we”) were exploring an SLA that really was structured as an insurance policy with different charges for different payouts if the SLA weren’t met — cheap if you just want to add “free” days to the duration of the contract, less cheap if you wanted to have money knocked off the next invoice, and insurance market rates for cash payouts in the event of interruption beyond that in the SLA, right up to payouts for maintenance work not scheduled in advance of agreement on the policy. Unfortunately, as has happened from time to time, there was a change of direction elsewhere in the organization, so this never advanced beyond a couple of experimental deals.

A couple of “be evil” observations:

Large buyers are often not really good at doing exactly the sort of analysis you wrote above. You can charge them more as a result, as you hint.

Actuarials on the supplier/insurer side now know the cost of any given outage and can feed that into decisions on how to allocate capital vs operational investment, how free you are to make changes that risk service interruptions, and so forth.

Plus it gives you another pricing knob you can turn fairly dynamically. Unbundling and pricing dynamics is often good for both parties, although many large buyers still prefer costs that are fixed in advance, even if that turns out to be more expensive. Great. Charge them more.

Two “be stupid evil” observations:

— Telcos rarely get their invoices right in the first place; adding knobs meets with resistance as a result

— Telcos far far far prefer people to make claims and throw up barriers to making one successfully. So do many insurers.

Finally: statistical service offerings are great, if you have sales channels that can cope with them. I wanted it out of their hands, with some wording in the contract that allowed for a device like a customer web page with sliders that allowed them to dynamically adjust bandwith caps, statistical delay and statistical drop parameters with a couple of presets along the lines of “[ ] UUNET quality [ ] AUCS quality”. The idea was to offer just slightly better than the competition’s SLAs and measured/reported performance at a similar price, but our much better performance for our almost always higher prices.


Discover and Recover

September 24, 2010

Softening strict requirement of optimality can make problems tractable. Put it another way it is more important to quickly narrow the search for an optimal solution to a “good enough” subset than to calculate the “perfect solution.” Ordinal (which is better) before Cardinal (value of optimum).
Compare the two scenarios presented below:

  1. Getting the best decision for certain – Cost = $1m
  2. Cost = $1m/x – Getting a decision within the top 5% With probability = 0.99*

In real life, we often settle for such a tradeoff with x=100 to 10,000

For systems that are not life-threatening, the focus should be on fast fault detection and mitigation (discover and recover) instead of exhaustively trying out every possible scenario which will make the system perfect but at such a cost that forward progress turns glacially slow. At which point your quicker, nimbler opponents will run over you with their faster product cycles.

People constantly deceive themselves into thinking that by paying much more they are getting a 100% solution but in reality you never get a true 100% solution, so its really a choice between different levels of less than perfect.

*Under independent sampling, variance decreases as 1/sqrt(n). Each order of magnitude increase in certainty requires 2 orders of magnitude increase in sampling cost. To go from p=0.99 to certainty (p=0.99999) implies a 1,000,000 fold increase in sampling cost.

References:

Cardinal vs Ordinal work by Dr. Yu-Chi Ho

Satisficing (wikipedia)


Core Business

September 16, 2010

We see continuous growth in managed services and we are confident that we can help Vodafone free up resources to focus even more on their core business and innovation

Can someone explain what Vodafone’s core business is?


cloud economics

August 9, 2010

Based on a discussion with some friends I decided to do a very simple model pitting Amazon Web Services (AWS) against colocation in commercial space with owned gear.  This model makes a few simplifying assumptions, including the fact that managing AWS is on the same order of magnitude of effort as managing your own gear. As someone put it:

You’d be surprised how much time and effort i’ve seen expended project-managing one’s cloud/hosting provider – it is not that different from the effort required for cooking in-house automation and deployment. It’s not like people are physically installing the OS and app stack off CD-ROM anymore, I’d imagine whether you’re automating AMIs/VMDKs or PXE it’s a similar effort.
The results were not surprising to anyone familiar with the term ‘duty cycle.’ Think of it as taking a taxi vs. buying a car to make a trip between San Francisco and Palo Alto. If you only make the trip once a quarter, it is cheaper to take a taxi. If you make the trip every day, then you are better off buying a car. The difference is the duty cycle. If you are running infrastructure with a duty cycle of 100%, it may make sense to run in-house. The model that I used for the evaluation is here.
Note that the pricing is skewed to the very high end for colocation, so the assumptions there are conservative. Levers are in yellow. Comments are welcomed.
I’d like to thank Adam, Dave and Randy for helping me make the model better.
Edit: Some folks are asking for graphs. I thought about adding sensitivity analysis to the model but that would be missing the point. This model presents an analytical framework which you are free to copy and then sharpen up with your own business model and cost structure. Running sensitivity analysis on that will be much more interesting. Added an NPV calculation for some people who asked for it.

Settlement Free Interconnect and Ratios

May 17, 2010

People seem to get particularly upset about ratios in SFI requirements. The argument almost always degenerates into bit/miles and “we’ll meet you at  your specified points and we’ll cold-potato.”  All these miss the salient point: the SF part of SFI is not based on bit/miles and meeting points. People always argue based on cost, when it is really based on value. Ratios in the end are just one data point in the equation.


Why WordPress and not Blogger part 2

April 21, 2010

WordPress has an android client and it is excellent. Mobile is the new laptop.


Management Books

April 11, 2010

I came across “The 12 Simple Secrets of Microsoft Management: How to Think and Act Like a Microsoft Manager and Take Your Company to the Top.” Reading it now in 2010, I can’t help but chuckle at the wide-eyed fanboy writing. Then I saw “The Google Way: How One Company Is Revolutionizing Management as We Know It and it cemented my opinion that whenever a book endorses any particular “way” of management with the benefit of hindsight and makes a point that all it would take for your company to be similarly successful is follow the bromides in the book, it is a clear sign that the person writing the book has no clue what they are going on about.

This is what the people think matters:

smarts and skill

This is actually what matters:

Luck and skill