A cost/benefit analysis of reliability and engineering tradeoffs.
Availability refers to the ability of the user community to access the system to do work. If a user cannot access the system, it is said to be unavailable.
A typical SLA for many services is expressed in nines (9s) of availability. For example, here is the AWS S3 SLA. Before we proceed further, I’d like to lay note the precise seconds of downtime per “9.” Source of the table below is Wikipedia.
||Downtime per year
||Downtime per month*
||Downtime per week
|90% (“one nine”)
|99% (“two nines”)
|99.9% (“three nines”)
|99.99% (“four nines”)
|99.999% (“five nines”)
|99.9999% (“six nines”)
* For monthly calculations, a 30-day month is used.
Lets say your system is responsible for a gross revenue of 12 billion USD a year and the revenue processing is equally distributed over time (safer assumption for a global company, but if it is not, change the model to account for peak hours usage). If the system uptime is five 9s (99.999%), should you spend the money to go to six 9s (99.9999%) availability? The answer is no.
Revenue/Year is 12B: 12,000,000,000.00.
Revenue/Second is 380.27
Lost Revenue at 5-9s availability is 120,000 and lost revenue at 6-9s is 12,000. The delta is 108,000. Which is less than 1FTE (loaded cost). So there is no reason to go to 6 nines at that revenue number if it requires more than 1 FTE worth of work to be done. Full model is here.
Sean M. Doran (smd) had some interesting insights as well. I’ve added them here with a bit of editing.
It’s an actuarial issue, and you can further trade off an SLA arrangement with business interruption insurance from a third party (which might be cheaper!).
We (FSVO “we”) were exploring an SLA that really was structured as an insurance policy with different charges for different payouts if the SLA weren’t met — cheap if you just want to add “free” days to the duration of the contract, less cheap if you wanted to have money knocked off the next invoice, and insurance market rates for cash payouts in the event of interruption beyond that in the SLA, right up to payouts for maintenance work not scheduled in advance of agreement on the policy. Unfortunately, as has happened from time to time, there was a change of direction elsewhere in the organization, so this never advanced beyond a couple of experimental deals.
A couple of “be evil” observations:
Large buyers are often not really good at doing exactly the sort of analysis you wrote above. You can charge them more as a result, as you hint.
Actuarials on the supplier/insurer side now know the cost of any given outage and can feed that into decisions on how to allocate capital vs operational investment, how free you are to make changes that risk service interruptions, and so forth.
Plus it gives you another pricing knob you can turn fairly dynamically. Unbundling and pricing dynamics is often good for both parties, although many large buyers still prefer costs that are fixed in advance, even if that turns out to be more expensive. Great. Charge them more.
Two “be stupid evil” observations:
— Telcos rarely get their invoices right in the first place; adding knobs meets with resistance as a result
— Telcos far far far prefer people to make claims and throw up barriers to making one successfully. So do many insurers.
Finally: statistical service offerings are great, if you have sales channels that can cope with them. I wanted it out of their hands, with some wording in the contract that allowed for a device like a customer web page with sliders that allowed them to dynamically adjust bandwith caps, statistical delay and statistical drop parameters with a couple of presets along the lines of “[ ] UUNET quality [ ] AUCS quality”. The idea was to offer just slightly better than the competition’s SLAs and measured/reported performance at a similar price, but our much better performance for our almost always higher prices.