Cloud Services and outcome-blind decision making

June 28, 2014

Whenever a cloud outage occurs [1], my social media stream is filled with people questioning the concept of cloud services in general, the competency of a company to run and operate the service, and anecdotes about the on-premise setups that haven’t taken a hit in years. This perfectly illustrates why humans in aggregate are bad at assessing risk and math.  This post doesn’t cover cloud failure modes exhaustively, rather it is targeted at risk and why directly comparing an IT shop to global service providers is not a statistically valid comparison.

The cloud risk assessment framework thought process has to account for the following factors [2]:

  • People exaggerate spectacular but rare risks and downplay common risks
  • People underestimate risks they willingly take and overestimate risks in situations they can’t control
  • People overestimate risks that are being talked about and remain an object of public scrutiny

A cloud service fits into all above categories.

Now, lets play a game:

I roll two (2) fair,  six-sided dice and sum up numbers on the top two faces, the result which will range from a minimum of 2 to a maximum of 12. The game continues for 100 rolls. Before each roll, you pick a number. If that number comes up, I pay you 1 dollar. If that number doesn’t come up, you get nothing. What strategy should you follow to maximize the amount of money won?

Try to think about this for some time. If you have some dice, try a couple of rolls before proceeding further.

The number you want to pick is exactly the same for all 100 rolls: Seven.

For each round, you cannot predict with certainty what any particular roll of the die will produce. However, regardless of what number comes up on any particular roll, you should bet on 7. This is an outcome-blind decision, because statistically with two fair dice, the highest probability sum of the top two faces is seven [3]. A different number coming up on the dice roll doesn’t invalidate the decision to bet on 7. in other words, separate decisions from outcomes. All you need to know is that over 100 runs, statistically, 7 will show up more, and therefore to maximize earnings, bet on 7.

Cloud outages dominate the news cycle, but to conduct a fair experiment, the cloud services must be compared to on-premise installs per account/minutes of availability in aggregate. Given the amount of talent and engineering effort required to run any cloud service at scale, the probability is high that for aggregate account/minutes of availability, cloud services are significantly more available than aggregate on-premise installations. If you are making outcome-blind decisions, they should favor cloud.

Whenever people mention a particular outage and compare it to some in-house implementation that hasn’t had an outage in years, point them to a good book on Poker and send them here.

Edit: people have pointed out that there are a lack of good aggregate data for on-prem. What data there are, are self-reported and noisy. A good proxy is the amount of data loss reported by the big storage systems in cloud – of which there hasn’t been any so far by the big providers. Taking KiB/month as a durability metric vs. data loss by smaller providers is a proxy for general system hygiene and competence [4].

[1] Google, Microsoft, Saleseforce, Amazon

[2] Bruce Schneier

[3] Two dice distribution

[4] Data loss report


Cloud Computing and Shorting

February 27, 2011

Reed Hastings, the CEO of Netflix is one of the smartest folks around in my book. His article on why Tilson should cover his Netflix short position strongly reinforces that belief. The entire article is a great lesson on how to think clearly about business but here, I want to focus on the relevant excerpt for cloud computing quoted below:

We will be working to improve the FCF conversion trend in 2011. On a long term basis, FCF should track net income reasonably closely, as it has in the past, with stock options as an offset against small buildups in PPE and prepaid content. Nearly all of our computing is through Amazon (AMZN) Web Services and CDNs, which are pure opex. [emphasis mine]

The key part is bolded above. Nearly all of Netflix computing is on-demand based, which is pure opex. Is it more expensive than building it in-house on a per-unit of compute? Almost certainly. However as Reed mentions in the paragraph above, he is pushing to improve control over Free Cash Flow (FCF) and bring it in on a quarter by quarter basis. Not having large capital costs is key to that. He specifically calls out that “Management at Netflix largely controls margins, but not growth.”

With minimal capital costs acting as drag and Netflix computing almost entirely opex based, moving FCF management into the quarter by quarter range is a lot more feasible, with the attendant ability to fine-tune his margins.

Cloud computing is already here – it’s just unevenly distributed. Reed Hastings is ahead of most.

cloud economics

August 9, 2010

Based on a discussion with some friends I decided to do a very simple model pitting Amazon Web Services (AWS) against colocation in commercial space with owned gear.  This model makes a few simplifying assumptions, including the fact that managing AWS is on the same order of magnitude of effort as managing your own gear. As someone put it:

You’d be surprised how much time and effort i’ve seen expended project-managing one’s cloud/hosting provider – it is not that different from the effort required for cooking in-house automation and deployment. It’s not like people are physically installing the OS and app stack off CD-ROM anymore, I’d imagine whether you’re automating AMIs/VMDKs or PXE it’s a similar effort.
The results were not surprising to anyone familiar with the term ‘duty cycle.’ Think of it as taking a taxi vs. buying a car to make a trip between San Francisco and Palo Alto. If you only make the trip once a quarter, it is cheaper to take a taxi. If you make the trip every day, then you are better off buying a car. The difference is the duty cycle. If you are running infrastructure with a duty cycle of 100%, it may make sense to run in-house. The model that I used for the evaluation is here.
Note that the pricing is skewed to the very high end for colocation, so the assumptions there are conservative. Levers are in yellow. Comments are welcomed.
I’d like to thank Adam, Dave and Randy for helping me make the model better.
Edit: Some folks are asking for graphs. I thought about adding sensitivity analysis to the model but that would be missing the point. This model presents an analytical framework which you are free to copy and then sharpen up with your own business model and cost structure. Running sensitivity analysis on that will be much more interesting. Added an NPV calculation for some people who asked for it.

Risk Assessment

December 19, 2009

A provider had an outage today. Nothing new. Outages happen.  What is surprising is the blamestorming from the companies that depend on the provider. Folks, if your business is down and you can’t survive because your infrastructure provider has had a problem, this is your fault.  There is a cost to redundancy, and if the cost of redundancy is greater than the expected impact of the outages for any period of time, then you don’t make your system redundant. You also lose the right to complain about the impact on your business. To make it clear, because people seem to have a hard time understanding this, I’ve built a simple model that can be used to evaluate the various scenarios.

I hope this helps people make an informed risk/benefit trade-off. The reduced noise on twitter will be a useful side benefit.

For folks who really want to dig into these arguments, please seen Ben Black on SLA’s here and here.

A quick observation on cloud economics

November 21, 2009

I just finished reading about a panel on cloud economics and the enterprise. One quote in particular stood out:

“I’m not sure there are any unit-cost advantages that are sustainable among large enterprises”

A few years ago some friends of mine had a startup publishing medical journals online. They started off by getting two fractional DS3 lines from MCI and Sprint to their office building. In the basement were a few racks of servers, storage arrays and it was off to the races. Today if someone came up with a plan of that nature, people would look at them funny and say “get a few racks from a colo provider.”  In another few years, I think the phrase is going to change to “get the compute and storage in the cloud.” The cost argument assumes today’s practice on tomorrows infrastructure. Next-generation business logic jobsets are going to be written for cloud frameworks, services and primitives, which should be more aligned with cost structures that make cloud computing more efficient per unit cycle of compute or unit bit of storage.


Cloud Part 2

August 25, 2009

Joe Weinman wrote an article on cloud computing titled 10 Reasons Why Telcos Will Dominate Enterprise Cloud Computing. My response to that article is here. Today, I was tracerouting to The results were surprising, so I did some more digging around.

vgill$ dig

; <<>> DiG 9.4.3-P1 <<>>
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4018
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;                   IN      A

;; ANSWER SECTION:            254     IN      CNAME 19856  IN      CNAME 11      IN      A

;; Query time: 18 msec
;; WHEN: Mon Aug 24 22:36:30 2009
;; MSG SIZE  rcvd: 115 points to a CDN service provided by Akamai!

Infrastructure is software

July 22, 2009

In an earlier post I mentioned that “cloud is software.”  Thinking about it some more, I believe the statement can be generalized to “Infrastructure is software.”  This is a bit different from how people have traditionally viewed it – Internet infrastructure is viewed as pipes, disks, CPUs, data centers. The collection of items that form the physical units that provide pipe, storage, compute and the buildings that house them. My thesis is that those are necessary but not sufficient to be considered infrastructure.  Those elements in and of themselves, are just so much sunk capital – to make efficient use of them you need the correct provisioning APIs, monitoring, billing, and software primitives that abstract away the underlying systems, allowing a decoupling between the various technological and business imperatives so that each layer can evolve independently based on their different technological scaling domains (within reason – if you are writing ultra-high performance code, you will know the difference if you get instantiated on an Opteron vs. a Nehalem cluster).

Lets make this concrete and think about how the above can inform the building and operations of a global service provider that has a large network, with datacenters that are used for a cloud computing business. A large telecommunications company for example that wants to provide enterprise cloud computing among a suite of services.

Basic Axioms

All things come down to the fundamental problem of mapping demand onto a set of lower level constraints. For a telecom company, constraints at the lowest level consist of:

  1. Fiber topology (or path/Right of Ways)
  2. Forwarding capacity
  3. Power & Space
  4. Follow The Money (FTM)

Everything thing else is an abstraction of the above constraints. That is the good news. The bad news: everyone has the same constraints. No special routers available to you and not to others, the speed of light is constant (modulo fiber refractive index in your physical plant), So how do you differentiate yourself? Fortunately, those are also simple:

  • Latency
  • Cost (note I did not use price for a reason)
  • Open Networks
  • Rich connectivity


Latency has been well documented. Some excerpts from Velocity 2009:

Eric Schurman (Bing) and Jake Brutlag (Google Search) co-presented results from latency experiments conducted independently on each site. Bing found that a 2 second slowdown changed queries/user by -1.8% and revenue/user by -4.3%. Google Search found that a 400 millisecond delay resulted in a -0.59% change in searches/user. What’s more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior. (video, slides)

Phil Dixon, from Shopzilla, had the most takeaway statistics about the impact of performance on the bottom line. A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs. (video, slides)

If you want to get into the cloud computing business, you will have to build your network and interconnection strategy to minimize latency. Your customers bottom line is at stake here, and by extension, so is your datacenter divisions P&L.


Sean Doran wrote “People that survive will be able to build a network at the lowest cost commensurate with their SLA.” He forgot to add – in a competitive market.  Assuming you are going up against competition, this should be fairly self-obvious: Efficiency and razor thin margins.  The killer App is bandwidth, and this means people need to emulate  Walmart ™. Learn to survive  on 10%  or lower margins. At those margins, your OSS/NMS are competitive advantages.  Every manual touch point in the business, every support call for a delayed order, failure in provisioning,  every salesperson that sells a service that can’t be provisioned properly, nibbles at the margin. Software that can provision the network,  enable fast turn up, proper accounting and auditing is the key.

And we react with great caution to suggestions that our poor businesses can be restored to satisfactory profitability by major capital expenditures.  (The projections will be dazzling – the advocates will be sincere – but, in the end, major additional investment in a terrible industry usually is about as rewarding as struggling in quicksand.)
-Warren Buffet

Efficiency also means fewer operational Issues. Couple ever increasing number of elements with ever growing mass of policy and you now are starting to lose any semblance of troubleshooting and operational simplicity. Does the network pass the 3 AM on-call test? More policy means more forwarding complexity, and that means more cost that hits your bottom line. A more insidious effect of intelligent, complex networks is that they inhibit experimentation. The theory of Real Options points out that experimentation is valuable when market uncertainty is high. Therefore, designing an architecture that fosters experimentation at the edge creates potential for greater value than centralized administration, because distributed structures promotes innovation and enables experimentation at low cost. This means that by putting the intelligence in the applications, rather than the network is a better use of capital – because otherwise, applications that don’t need that robustness will end up paying for it, and this will end up making experimentation expensive.

Open Networks

Open networks strikes fear into the heart of service providers everywhere.  If you are in a commodity business, how differentiate yourself?  How about providing service that works well, cheaply.  But wait a minute!  Whatever happened  to “climb up the value chain?” The answer is nothing. You have to decide what business you are in.  Moving up the value chain and providing ever higher-touch services are in direct conflict with providing low cost bulk bandwidth.  Pick businesses that require either massive horizontal scaling or deep vertical scaling. Picking both leaves you vulnerable to more narrowly focused competitors in each segment. If horizontal scaling is central to one business, trying to fit an orthogonal model also as a core business will end up annoying everyone and serving no one well.  However, if the software interface to the horizontal business is exposed to the vertical high-touch side of the business, both can be decoupled from each other and allowed to scale independently.  This means things like provisioning, SLA reporting, billing, usage reporting all exposed via software mechanisms.

Rich Connectivity

Let me start off by saying content is not king.

Gaming companies are making the same mistakes as the
content guys. They always over-estimate the importance of
the content and vastly underestimate the desire of users/people
to communicate with each other and share…
-Joi Ito

The Internet is a network of networks. The real value of a network is realized when it connects to other networks, more detail can be found in  Metcalfe’sLaw, and Reed’s Law.  Making interconnections with other networks harder than is necessary will eventually result in isolation and a drive to irrelevance (in an open market).  If people who are transiting your network to get to another network find that the interconnection between your network and their destination network is chronically congested or adds significant latency, the incentive to directly interconnect with the destination network or find another upstream becomes stronger.

It ain’t the metal, it ain’t the glass; it’s the wetware.
-Tony Li


Make the network be database authoritative.  This will allow for faster provisioning, consistency, auditing. You can tell authoritatively if two buildings across the country or the world are on-net and more importantly, if they can be connected together in what timeframe. This is especially true if you have a few acqusitions with a mixture of assets. Just mashing together the list of buildings that are now on-net with the merged entity doesn’t actually tell you if they can be connected together easily or through several different fiber runs, patch panels, and networks.  If the provisioning systems were correct, the sales folks could tell prospective customers when services could be delivered because they’d know if connecting two buildings involved ordering cross-connects or if it involved doing a fiber build. We provision thousands of machines automatically, why treat thousands of routers differently? The systems that automatically provision and scale your network are hard to implement, but they can be built. It only requires the force of will to make it happen.

All these things give a better quality of service to the end user and are a competitive advantage in reducing OPEX and SLA payouts due to error in configurations. You can futher extend your systems to do things like automatic rollbacks if you make a change and something goes wrong.

Software is the key, no matter what your business is if it deals with the internet and it will be increasingly true going forward.