Cloud Services and outcome-blind decision making

Whenever a cloud outage occurs [1], my social media stream is filled with people questioning the concept of cloud services in general, the competency of a company to run and operate the service, and anecdotes about the on-premise setups that haven’t taken a hit in years. This perfectly illustrates why humans in aggregate are bad at assessing risk and math.  This post doesn’t cover cloud failure modes exhaustively, rather it is targeted at risk and why directly comparing an IT shop to global service providers is not a statistically valid comparison.

The cloud risk assessment framework thought process has to account for the following factors [2]:

  • People exaggerate spectacular but rare risks and downplay common risks
  • People underestimate risks they willingly take and overestimate risks in situations they can’t control
  • People overestimate risks that are being talked about and remain an object of public scrutiny

A cloud service fits into all above categories.

Now, lets play a game:

I roll two (2) fair,  six-sided dice and sum up numbers on the top two faces, the result which will range from a minimum of 2 to a maximum of 12. The game continues for 100 rolls. Before each roll, you pick a number. If that number comes up, I pay you 1 dollar. If that number doesn’t come up, you get nothing. What strategy should you follow to maximize the amount of money won?

Try to think about this for some time. If you have some dice, try a couple of rolls before proceeding further.

The number you want to pick is exactly the same for all 100 rolls: Seven.

For each round, you cannot predict with certainty what any particular roll of the die will produce. However, regardless of what number comes up on any particular roll, you should bet on 7. This is an outcome-blind decision, because statistically with two fair dice, the highest probability sum of the top two faces is seven [3]. A different number coming up on the dice roll doesn’t invalidate the decision to bet on 7. in other words, separate decisions from outcomes. All you need to know is that over 100 runs, statistically, 7 will show up more, and therefore to maximize earnings, bet on 7.

Cloud outages dominate the news cycle, but to conduct a fair experiment, the cloud services must be compared to on-premise installs per account/minutes of availability in aggregate. Given the amount of talent and engineering effort required to run any cloud service at scale, the probability is high that for aggregate account/minutes of availability, cloud services are significantly more available than aggregate on-premise installations. If you are making outcome-blind decisions, they should favor cloud.

Whenever people mention a particular outage and compare it to some in-house implementation that hasn’t had an outage in years, point them to a good book on Poker and send them here.

Edit: people have pointed out that there are a lack of good aggregate data for on-prem. What data there are, are self-reported and noisy. A good proxy is the amount of data loss reported by the big storage systems in cloud – of which there hasn’t been any so far by the big providers. Taking KiB/month as a durability metric vs. data loss by smaller providers is a proxy for general system hygiene and competence [4].

[1] Google, Microsoft, Saleseforce, Amazon

[2] Bruce Schneier

[3] Two dice distribution

[4] Data loss report

2 Responses to Cloud Services and outcome-blind decision making

  1. thegameiam says:

    One of the common components in cloud environments I’ve seen is that it can become easy to add multiple layers of indirection over the same underlying service, introducing novel failure modes that don’t exist at all in the standalone self-run environment. (Cf Meyer, NANOG 61)

    To use your die roll analogy: if you always bet on 7 when rolling two dice, you have a 6/30 = 1//6 probability of winning, and thus a 5/6 probability of losing over time. It isn’t unreasonable for a designer to attempt to control for 29/30 (everything but snake-eyes) now is it? That would still be a less-than-two nines network. If you’re in a position where your option is to only bet on one number, the right answer is *not to bet at all*. (Poker is a game of skill which includes cards, and a lot of the skill is knowing when not to bet).

    My retort is thus: those “spectacular but rare” events are turning out to be less rare than initially thought- perhaps due to the nature of the services, or perhaps due to something else entirely. You mention the aggregate talent and capacity, which is certainly the case: there is tremendous engineering talent in the cloud space.

    However. The offerings and designs are necessarily complex, often to a mindbogglingly-high degree, and it is a completely reasonable question to ask whether the simpler in-house network would actually provide higher availability than the more complex outsourced cloud one. I’d say “it depends.”

  2. dancres says:

    I think the questioners are typically guilty of some other sins as well:

    (1) Comparison of apples and oranges – most on premise setups have neither the load nor the scale of cloud setups. Comparing the performance of one against the other under such circumstances is pointless.

    (2) Absence of information – most cloud providers are silent on just exactly what their infrastructure looks like. Comparing what one does know about (possibly) versus that which one has no information about is foolish at best.

    (3) Absence of historical data/rose-tinted specs – Few organisations keep accurate records of their downtime and/or other past failures. In various cases, that which is deemed downtime is deliberately limited to obscure/rare cases as the result of organisational social pressure. Further, most of us, do not have perfect recall and often forget relevant events.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: