We shape our tools and then our tools shape us

December 26, 2017

The title words are by Father John Culkin, SJ. Re-reading them years later made me reflect on my journey here.

Memories of events are anisotropic: Memory changes the circumstances, reworks the lessons and retells the stories, so the UUNET I remember isn’t the UUNET that was, but it’ll do. Google deservedly gets the credit for SRE, but it was at UUNET that I first remember learning from Bill Barns, Mike O’Dell, and Louis Mamakos et al. that to scale in distributed systems, you had to build machines that built machines.


How many Nines?

November 10, 2010

A cost/benefit analysis of reliability and engineering tradeoffs.

Availability refers to the ability of the user community to access the system to do work. If a user cannot access the system, it is said to be unavailable.

A typical SLA for many services is expressed in nines (9s) of availability. For example, here is the AWS S3 SLA. Before we proceed further, I’d like to lay note the precise seconds of downtime per “9.” Source of the table below is Wikipedia.

Availability % Downtime per year Downtime per month* Downtime per week
90% (“one nine”) 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% (“two nines”) 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.9% (“three nines”) 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% (“four nines”) 52.56 minutes 4.32 minutes 1.01 minutes
99.999% (“five nines”) 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% (“six nines”) 31.5 seconds 2.59 seconds 0.605 seconds

* For monthly calculations, a 30-day month is used.

Lets say your system is responsible for a gross revenue of 12 billion USD a year and the revenue processing is equally distributed over time (safer assumption for a global company, but if it is not, change the model to account for peak hours usage). If the system uptime is five 9s (99.999%), should you spend the money to go to six 9s (99.9999%) availability? The answer is no.

Revenue/Year is 12B: 12,000,000,000.00.

Revenue/Second is 380.27

Lost Revenue at 5-9s availability is 120,000 and lost revenue at 6-9s is 12,000. The delta is 108,000. Which is less than 1FTE (loaded cost). So there is no reason to go to 6 nines at that revenue number if it requires more than 1 FTE worth of work to be done. Full model is here.

Sean M. Doran (smd) had some interesting insights as well. I’ve added them here with a bit of editing.

It’s an actuarial issue, and you can further trade off an SLA arrangement with business interruption insurance from a third party (which might be cheaper!).

We (FSVO “we”) were exploring an SLA that really was structured as an insurance policy with different charges for different payouts if the SLA weren’t met — cheap if you just want to add “free” days to the duration of the contract, less cheap if you wanted to have money knocked off the next invoice, and insurance market rates for cash payouts in the event of interruption beyond that in the SLA, right up to payouts for maintenance work not scheduled in advance of agreement on the policy. Unfortunately, as has happened from time to time, there was a change of direction elsewhere in the organization, so this never advanced beyond a couple of experimental deals.

A couple of “be evil” observations:

Large buyers are often not really good at doing exactly the sort of analysis you wrote above. You can charge them more as a result, as you hint.

Actuarials on the supplier/insurer side now know the cost of any given outage and can feed that into decisions on how to allocate capital vs operational investment, how free you are to make changes that risk service interruptions, and so forth.

Plus it gives you another pricing knob you can turn fairly dynamically. Unbundling and pricing dynamics is often good for both parties, although many large buyers still prefer costs that are fixed in advance, even if that turns out to be more expensive. Great. Charge them more.

Two “be stupid evil” observations:

— Telcos rarely get their invoices right in the first place; adding knobs meets with resistance as a result

— Telcos far far far prefer people to make claims and throw up barriers to making one successfully. So do many insurers.

Finally: statistical service offerings are great, if you have sales channels that can cope with them. I wanted it out of their hands, with some wording in the contract that allowed for a device like a customer web page with sliders that allowed them to dynamically adjust bandwith caps, statistical delay and statistical drop parameters with a couple of presets along the lines of “[ ] UUNET quality [ ] AUCS quality”. The idea was to offer just slightly better than the competition’s SLAs and measured/reported performance at a similar price, but our much better performance for our almost always higher prices.

Core Business

September 16, 2010

We see continuous growth in managed services and we are confident that we can help Vodafone free up resources to focus even more on their core business and innovation

Can someone explain what Vodafone’s core business is?

cloud economics

August 9, 2010

Based on a discussion with some friends I decided to do a very simple model pitting Amazon Web Services (AWS) against colocation in commercial space with owned gear.  This model makes a few simplifying assumptions, including the fact that managing AWS is on the same order of magnitude of effort as managing your own gear. As someone put it:

You’d be surprised how much time and effort i’ve seen expended project-managing one’s cloud/hosting provider – it is not that different from the effort required for cooking in-house automation and deployment. It’s not like people are physically installing the OS and app stack off CD-ROM anymore, I’d imagine whether you’re automating AMIs/VMDKs or PXE it’s a similar effort.
The results were not surprising to anyone familiar with the term ‘duty cycle.’ Think of it as taking a taxi vs. buying a car to make a trip between San Francisco and Palo Alto. If you only make the trip once a quarter, it is cheaper to take a taxi. If you make the trip every day, then you are better off buying a car. The difference is the duty cycle. If you are running infrastructure with a duty cycle of 100%, it may make sense to run in-house. The model that I used for the evaluation is here.
Note that the pricing is skewed to the very high end for colocation, so the assumptions there are conservative. Levers are in yellow. Comments are welcomed.
I’d like to thank Adam, Dave and Randy for helping me make the model better.
Edit: Some folks are asking for graphs. I thought about adding sensitivity analysis to the model but that would be missing the point. This model presents an analytical framework which you are free to copy and then sharpen up with your own business model and cost structure. Running sensitivity analysis on that will be much more interesting. Added an NPV calculation for some people who asked for it.

Settlement Free Interconnect and Ratios

May 17, 2010

People seem to get particularly upset about ratios in SFI requirements. The argument almost always degenerates into bit/miles and “we’ll meet you at  your specified points and we’ll cold-potato.”  All these miss the salient point: the SF part of SFI is not based on bit/miles and meeting points. People always argue based on cost, when it is really based on value. Ratios in the end are just one data point in the equation.

Risk Assessment

December 19, 2009

A provider had an outage today. Nothing new. Outages happen.  What is surprising is the blamestorming from the companies that depend on the provider. Folks, if your business is down and you can’t survive because your infrastructure provider has had a problem, this is your fault.  There is a cost to redundancy, and if the cost of redundancy is greater than the expected impact of the outages for any period of time, then you don’t make your system redundant. You also lose the right to complain about the impact on your business. To make it clear, because people seem to have a hard time understanding this, I’ve built a simple model that can be used to evaluate the various scenarios.

I hope this helps people make an informed risk/benefit trade-off. The reduced noise on twitter will be a useful side benefit.

For folks who really want to dig into these arguments, please seen Ben Black on SLA’s here and here.

The Rise of Now

November 21, 2009

The ATLAS Internet Observatory report has some excellent observations, some of which are quoted below:

  • As category, CDNs represent close to 10% of Internet traffic
  • Web (and video over HTTP) largest and faster growing
  • Followed by P2P (which is also fastest shrinking)
  • P2P increasingly eclipsed by streaming

The salient points are that streaming is the fastest growing traffic by volume, CDNs  now represent about 10% of all Internet traffic, and that P2P is declining. Some of the decline of P2P can be attributed to the fact that P2P clients are now better at masquerading,  encryption and localization, but the overall trend as measured by DPI techniques is still downwards. P2P has been declining while  CDN and streaming media traffic, most of which is micropayment driven – either directly with money like Apples iTunes store, or with attention like Pandora and YouTube – has been rising. I think it makes it clear that “now” is more important than “free.”

Trying to set up P2P is a non-trivial exercise involving firewall and home gateway modifications for port forwarding, tracker location,  content hunting and so on . Centralized media delivery on the other hand is merely a click away – and if I have to pay 99 cents for a song, that doesn’t seem all that expensive.

In his thought provoking presentation at Infocom 2009, Dr. Andrew Odlyzko makes a key point: The function of data networks is to satisfy human impatience, and human impatience is infinite. Andrew also points out some classic telecom dogmas (slide 6):

  • Carriers can develop innovative new services
  • Content is king
  • Voice is passe
  • Streaming real-time multimedia traffic will dominate
  • There is an urgent need for new “killer apps”

I’ve been beating that drum for a while, especially the fact that carriers can’t develop innovative new services – they don’t have the DNA to do so (see my posts here and here) and that content is not king – connectivity is the real value. The great success story of the past few years in telecom space has been wireless and again, Andrew points out that the telecoms learned the wrong lessons. Paraphrasing Andrew: “The prevailing industry view is that profits resulted from tight control of  wireless while losses resulted from the wild and uncontrolled Internet while in reality success came from providing mobility for voice and simple text messaging.” I believe success was an direct outgrowth of convenience and “now.”  I can get much cheaper rates on a per-minute basis using a land line, but having the ability to connect to people where and when I want is the true value of mobile. I am once again, trading convenience for money. Micropayments for convenient minutes to put it another way.

Ideally, things would be available now and for free, but given the trade-off space, I am willing to trade some money for immediacy. Therein lies a lesson.