Cloud Services and outcome-blind decision making

June 28, 2014

Whenever a cloud outage occurs [1], my social media stream is filled with people questioning the concept of cloud services in general, the competency of a company to run and operate the service, and anecdotes about the on-premise setups that haven’t taken a hit in years. This perfectly illustrates why humans in aggregate are bad at assessing risk and math.  This post doesn’t cover cloud failure modes exhaustively, rather it is targeted at risk and why directly comparing an IT shop to global service providers is not a statistically valid comparison.

The cloud risk assessment framework thought process has to account for the following factors [2]:

  • People exaggerate spectacular but rare risks and downplay common risks
  • People underestimate risks they willingly take and overestimate risks in situations they can’t control
  • People overestimate risks that are being talked about and remain an object of public scrutiny

A cloud service fits into all above categories.

Now, lets play a game:

I roll two (2) fair,  six-sided dice and sum up numbers on the top two faces, the result which will range from a minimum of 2 to a maximum of 12. The game continues for 100 rolls. Before each roll, you pick a number. If that number comes up, I pay you 1 dollar. If that number doesn’t come up, you get nothing. What strategy should you follow to maximize the amount of money won?

Try to think about this for some time. If you have some dice, try a couple of rolls before proceeding further.

The number you want to pick is exactly the same for all 100 rolls: Seven.

For each round, you cannot predict with certainty what any particular roll of the die will produce. However, regardless of what number comes up on any particular roll, you should bet on 7. This is an outcome-blind decision, because statistically with two fair dice, the highest probability sum of the top two faces is seven [3]. A different number coming up on the dice roll doesn’t invalidate the decision to bet on 7. in other words, separate decisions from outcomes. All you need to know is that over 100 runs, statistically, 7 will show up more, and therefore to maximize earnings, bet on 7.

Cloud outages dominate the news cycle, but to conduct a fair experiment, the cloud services must be compared to on-premise installs per account/minutes of availability in aggregate. Given the amount of talent and engineering effort required to run any cloud service at scale, the probability is high that for aggregate account/minutes of availability, cloud services are significantly more available than aggregate on-premise installations. If you are making outcome-blind decisions, they should favor cloud.

Whenever people mention a particular outage and compare it to some in-house implementation that hasn’t had an outage in years, point them to a good book on Poker and send them here.

Edit: people have pointed out that there are a lack of good aggregate data for on-prem. What data there are, are self-reported and noisy. A good proxy is the amount of data loss reported by the big storage systems in cloud – of which there hasn’t been any so far by the big providers. Taking KiB/month as a durability metric vs. data loss by smaller providers is a proxy for general system hygiene and competence [4].

[1] Google, Microsoft, Saleseforce, Amazon

[2] Bruce Schneier

[3] Two dice distribution

[4] Data loss report

Startups are hard in big companies and Survivorship Bias

June 21, 2014

Hunter Walk and Steven Sinofsky were articulating reasons why startups are hard inside established companies. They are by and large correct but I think they need to take survivorship bias into account.

What people remember are the ones that made it big, not the ones that died silently. For every WhatsApp, there are many BelugaPods. On the outside, you have 3-4 startups doing similar things, from which you can select the winner.  Internally, there is often only one effort. We still can’t articulate with any degree of accuracy what are the factors that make success happen, so by definition if you are taking 4-5x more shots on goal on the outside, you’ll end up with more successful outcomes. Simply because you are taking more shots. Buying a successful “disruptive” startup is discovery. 

Conversation for more details is linked below.


April 8, 2014

Any effective organization consists of organizational design (how the organization is shaped), the organizational talent, and the operating  culture. All three combine to deliver organizational capability (the value the organization delivers to the company). In other words, the culture is the operating system of the organization that allows the people to deliver value. Culture sets expectations, behaviors, operating norms and contracts that allow people to to act with efficacy, with role clarity and to refer to a common set principles that allow for effective delegated decision making and accountability. Setting context and then allowing freedom to execute has proven effective in my experience,  given the right organizational design and talent embedded in appropriate cultural norms.


Winner take all

October 14, 2012

“That’s how winner-take-all works. You don’t get 100x better results because you were 100x better. You get them because you were 1% better and there was no prize for second place.” -CJ V.


June 3, 2011

Once again, I feel compelled to respond to an article in the El Reg (and one here).

Leaving aside the inaccuracies in the pieces, I feel that such articles are toxic. The simplistic approach sensationalizes a few personalities and while it may make for good copy and get pageviews,  the actual folks who did the work get no credit and it is wrong. With such a talented pool of folks at Google, enabled by a culture of excellence that starts from the top, anyone could have done a great job, but no one mentions that. The work is about the team, the culture of data-driven decisions, fair and firm debate, and a refusal to compromise.  Credit should go unto those who deserve it: the incredible teams at the company. My own part if any, was managing to get out of the way of the folks doing the work.

Cloud Computing and Shorting

February 27, 2011

Reed Hastings, the CEO of Netflix is one of the smartest folks around in my book. His article on why Tilson should cover his Netflix short position strongly reinforces that belief. The entire article is a great lesson on how to think clearly about business but here, I want to focus on the relevant excerpt for cloud computing quoted below:

We will be working to improve the FCF conversion trend in 2011. On a long term basis, FCF should track net income reasonably closely, as it has in the past, with stock options as an offset against small buildups in PPE and prepaid content. Nearly all of our computing is through Amazon (AMZN) Web Services and CDNs, which are pure opex. [emphasis mine]

The key part is bolded above. Nearly all of Netflix computing is on-demand based, which is pure opex. Is it more expensive than building it in-house on a per-unit of compute? Almost certainly. However as Reed mentions in the paragraph above, he is pushing to improve control over Free Cash Flow (FCF) and bring it in on a quarter by quarter basis. Not having large capital costs is key to that. He specifically calls out that “Management at Netflix largely controls margins, but not growth.”

With minimal capital costs acting as drag and Netflix computing almost entirely opex based, moving FCF management into the quarter by quarter range is a lot more feasible, with the attendant ability to fine-tune his margins.

Cloud computing is already here – it’s just unevenly distributed. Reed Hastings is ahead of most.

Some papers on datacenter center computing

November 19, 2010

Some papers on the growing trend of warehouse-scale computing, the Internet transformation driven by datacenter applications, and the opportunities and challenges for fiber optic communication technologies to support their growth in the next three to four years.


Get every new post delivered to your Inbox.