“That’s how winner-take-all works. You don’t get 100x better results because you were 100x better. You get them because you were 1% better and there was no prize for second place.” -CJ V.
Leaving aside the inaccuracies in the pieces, I feel that such articles are toxic. The simplistic approach sensationalizes a few personalities and while it may make for good copy and get pageviews, the actual folks who did the work get no credit and it is wrong. With such a talented pool of folks at Google, enabled by a culture of excellence that starts from the top, anyone could have done a great job, but no one mentions that. The work is about the team, the culture of data-driven decisions, fair and firm debate, and a refusal to compromise. Credit should go unto those who deserve it: the incredible teams at the company. My own part if any, was managing to get out of the way of the folks doing the work.
Reed Hastings, the CEO of Netflix is one of the smartest folks around in my book. His article on why Tilson should cover his Netflix short position strongly reinforces that belief. The entire article is a great lesson on how to think clearly about business but here, I want to focus on the relevant excerpt for cloud computing quoted below:
We will be working to improve the FCF conversion trend in 2011. On a long term basis, FCF should track net income reasonably closely, as it has in the past, with stock options as an offset against small buildups in PPE and prepaid content. Nearly all of our computing is through Amazon (AMZN) Web Services and CDNs, which are pure opex. [emphasis mine]
The key part is bolded above. Nearly all of Netflix computing is on-demand based, which is pure opex. Is it more expensive than building it in-house on a per-unit of compute? Almost certainly. However as Reed mentions in the paragraph above, he is pushing to improve control over Free Cash Flow (FCF) and bring it in on a quarter by quarter basis. Not having large capital costs is key to that. He specifically calls out that “Management at Netflix largely controls margins, but not growth.”
With minimal capital costs acting as drag and Netflix computing almost entirely opex based, moving FCF management into the quarter by quarter range is a lot more feasible, with the attendant ability to fine-tune his margins.
Cloud computing is already here – it’s just unevenly distributed. Reed Hastings is ahead of most.
Some papers on the growing trend of warehouse-scale computing, the Internet transformation driven by datacenter applications, and the opportunities and challenges for fiber optic communication technologies to support their growth in the next three to four years.
A cost/benefit analysis of reliability and engineering tradeoffs.
Availability refers to the ability of the user community to access the system to do work. If a user cannot access the system, it is said to be unavailable.
A typical SLA for many services is expressed in nines (9s) of availability. For example, here is the AWS S3 SLA. Before we proceed further, I’d like to lay note the precise seconds of downtime per “9.” Source of the table below is Wikipedia.
|Availability %||Downtime per year||Downtime per month*||Downtime per week|
|90% (“one nine”)||36.5 days||72 hours||16.8 hours|
|95%||18.25 days||36 hours||8.4 hours|
|98%||7.30 days||14.4 hours||3.36 hours|
|99% (“two nines”)||3.65 days||7.20 hours||1.68 hours|
|99.5%||1.83 days||3.60 hours||50.4 minutes|
|99.8%||17.52 hours||86.23 minutes||20.16 minutes|
|99.9% (“three nines”)||8.76 hours||43.2 minutes||10.1 minutes|
|99.95%||4.38 hours||21.56 minutes||5.04 minutes|
|99.99% (“four nines”)||52.56 minutes||4.32 minutes||1.01 minutes|
|99.999% (“five nines”)||5.26 minutes||25.9 seconds||6.05 seconds|
|99.9999% (“six nines”)||31.5 seconds||2.59 seconds||0.605 seconds|
* For monthly calculations, a 30-day month is used.
Lets say your system is responsible for a gross revenue of 12 billion USD a year and the revenue processing is equally distributed over time (safer assumption for a global company, but if it is not, change the model to account for peak hours usage). If the system uptime is five 9s (99.999%), should you spend the money to go to six 9s (99.9999%) availability? The answer is no.
Revenue/Year is 12B: 12,000,000,000.00.
Revenue/Second is 380.27
Lost Revenue at 5-9s availability is 120,000 and lost revenue at 6-9s is 12,000. The delta is 108,000. Which is less than 1FTE (loaded cost). So there is no reason to go to 6 nines at that revenue number if it requires more than 1 FTE worth of work to be done. Full model is here.
Sean M. Doran (smd) had some interesting insights as well. I’ve added them here with a bit of editing.
It’s an actuarial issue, and you can further trade off an SLA arrangement with business interruption insurance from a third party (which might be cheaper!).
We (FSVO “we”) were exploring an SLA that really was structured as an insurance policy with different charges for different payouts if the SLA weren’t met — cheap if you just want to add “free” days to the duration of the contract, less cheap if you wanted to have money knocked off the next invoice, and insurance market rates for cash payouts in the event of interruption beyond that in the SLA, right up to payouts for maintenance work not scheduled in advance of agreement on the policy. Unfortunately, as has happened from time to time, there was a change of direction elsewhere in the organization, so this never advanced beyond a couple of experimental deals.
A couple of “be evil” observations:
Large buyers are often not really good at doing exactly the sort of analysis you wrote above. You can charge them more as a result, as you hint.
Actuarials on the supplier/insurer side now know the cost of any given outage and can feed that into decisions on how to allocate capital vs operational investment, how free you are to make changes that risk service interruptions, and so forth.
Plus it gives you another pricing knob you can turn fairly dynamically. Unbundling and pricing dynamics is often good for both parties, although many large buyers still prefer costs that are fixed in advance, even if that turns out to be more expensive. Great. Charge them more.
Two “be stupid evil” observations:
— Telcos rarely get their invoices right in the first place; adding knobs meets with resistance as a result
— Telcos far far far prefer people to make claims and throw up barriers to making one successfully. So do many insurers.
Finally: statistical service offerings are great, if you have sales channels that can cope with them. I wanted it out of their hands, with some wording in the contract that allowed for a device like a customer web page with sliders that allowed them to dynamically adjust bandwith caps, statistical delay and statistical drop parameters with a couple of presets along the lines of “[ ] UUNET quality [ ] AUCS quality”. The idea was to offer just slightly better than the competition’s SLAs and measured/reported performance at a similar price, but our much better performance for our almost always higher prices.
Softening strict requirement of optimality can make problems tractable. Put it another way it is more important to quickly narrow the search for an optimal solution to a “good enough” subset than to calculate the “perfect solution.” Ordinal (which is better) before Cardinal (value of optimum).
Compare the two scenarios presented below:
- Getting the best decision for certain – Cost = $1m
- Cost = $1m/x – Getting a decision within the top 5% With probability = 0.99*
In real life, we often settle for such a tradeoff with x=100 to 10,000
For systems that are not life-threatening, the focus should be on fast fault detection and mitigation (discover and recover) instead of exhaustively trying out every possible scenario which will make the system perfect but at such a cost that forward progress turns glacially slow. At which point your quicker, nimbler opponents will run over you with their faster product cycles.
People constantly deceive themselves into thinking that by paying much more they are getting a 100% solution but in reality you never get a true 100% solution, so its really a choice between different levels of less than perfect.
*Under independent sampling, variance decreases as 1/sqrt(n). Each order of magnitude increase in certainty requires 2 orders of magnitude increase in sampling cost. To go from p=0.99 to certainty (p=0.99999) implies a 1,000,000 fold increase in sampling cost.
Cardinal vs Ordinal work by Dr. Yu-Chi Ho
Can someone explain what Vodafone’s core business is?
Based on a discussion with some friends I decided to do a very simple model pitting Amazon Web Services (AWS) against colocation in commercial space with owned gear. This model makes a few simplifying assumptions, including the fact that managing AWS is on the same order of magnitude of effort as managing your own gear. As someone put it:
You’d be surprised how much time and effort i’ve seen expended project-managing one’s cloud/hosting provider – it is not that different from the effort required for cooking in-house automation and deployment. It’s not like people are physically installing the OS and app stack off CD-ROM anymore, I’d imagine whether you’re automating AMIs/VMDKs or PXE it’s a similar effort.
People seem to get particularly upset about ratios in SFI requirements. The argument almost always degenerates into bit/miles and “we’ll meet you at your specified points and we’ll cold-potato.” All these miss the salient point: the SF part of SFI is not based on bit/miles and meeting points. People always argue based on cost, when it is really based on value. Ratios in the end are just one data point in the equation.
WordPress has an android client and it is excellent. Mobile is the new laptop.