We shape our tools and then our tools shape us

December 26, 2017

The title words are by Father John Culkin, SJ. Re-reading them years later made me reflect on my journey here.

Memories of events are anisotropic: Memory changes the circumstances, reworks the lessons and retells the stories, so the UUNET I remember isn’t the UUNET that was, but it’ll do. Google deservedly gets the credit for SRE, but it was at UUNET that I first remember learning from Bill Barns, Mike O’Dell, and Louis Mamakos et al. that to scale in distributed systems, you had to build machines that built machines.

Winner take all

October 14, 2012

“That’s how winner-take-all works. You don’t get 100x better results because you were 100x better. You get them because you were 1% better and there was no prize for second place.” -CJ V.

Cloud Computing and Shorting

February 27, 2011

Reed Hastings, the CEO of Netflix is one of the smartest folks around in my book. His article on why Tilson should cover his Netflix short position strongly reinforces that belief. The entire article is a great lesson on how to think clearly about business but here, I want to focus on the relevant excerpt for cloud computing quoted below:

We will be working to improve the FCF conversion trend in 2011. On a long term basis, FCF should track net income reasonably closely, as it has in the past, with stock options as an offset against small buildups in PPE and prepaid content. Nearly all of our computing is through Amazon (AMZN) Web Services and CDNs, which are pure opex. [emphasis mine]

The key part is bolded above. Nearly all of Netflix computing is on-demand based, which is pure opex. Is it more expensive than building it in-house on a per-unit of compute? Almost certainly. However as Reed mentions in the paragraph above, he is pushing to improve control over Free Cash Flow (FCF) and bring it in on a quarter by quarter basis. Not having large capital costs is key to that. He specifically calls out that “Management at Netflix largely controls margins, but not growth.”

With minimal capital costs acting as drag and Netflix computing almost entirely opex based, moving FCF management into the quarter by quarter range is a lot more feasible, with the attendant ability to fine-tune his margins.

Cloud computing is already here – it’s just unevenly distributed. Reed Hastings is ahead of most.

Some papers on datacenter center computing

November 19, 2010

Some papers on the growing trend of warehouse-scale computing, the Internet transformation driven by datacenter applications, and the opportunities and challenges for fiber optic communication technologies to support their growth in the next three to four years.


How many Nines?

November 10, 2010

A cost/benefit analysis of reliability and engineering tradeoffs.

Availability refers to the ability of the user community to access the system to do work. If a user cannot access the system, it is said to be unavailable.

A typical SLA for many services is expressed in nines (9s) of availability. For example, here is the AWS S3 SLA. Before we proceed further, I’d like to lay note the precise seconds of downtime per “9.” Source of the table below is Wikipedia.

Availability % Downtime per year Downtime per month* Downtime per week
90% (“one nine”) 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% (“two nines”) 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.9% (“three nines”) 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% (“four nines”) 52.56 minutes 4.32 minutes 1.01 minutes
99.999% (“five nines”) 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% (“six nines”) 31.5 seconds 2.59 seconds 0.605 seconds

* For monthly calculations, a 30-day month is used.

Lets say your system is responsible for a gross revenue of 12 billion USD a year and the revenue processing is equally distributed over time (safer assumption for a global company, but if it is not, change the model to account for peak hours usage). If the system uptime is five 9s (99.999%), should you spend the money to go to six 9s (99.9999%) availability? The answer is no.

Revenue/Year is 12B: 12,000,000,000.00.

Revenue/Second is 380.27

Lost Revenue at 5-9s availability is 120,000 and lost revenue at 6-9s is 12,000. The delta is 108,000. Which is less than 1FTE (loaded cost). So there is no reason to go to 6 nines at that revenue number if it requires more than 1 FTE worth of work to be done. Full model is here.

Sean M. Doran (smd) had some interesting insights as well. I’ve added them here with a bit of editing.

It’s an actuarial issue, and you can further trade off an SLA arrangement with business interruption insurance from a third party (which might be cheaper!).

We (FSVO “we”) were exploring an SLA that really was structured as an insurance policy with different charges for different payouts if the SLA weren’t met — cheap if you just want to add “free” days to the duration of the contract, less cheap if you wanted to have money knocked off the next invoice, and insurance market rates for cash payouts in the event of interruption beyond that in the SLA, right up to payouts for maintenance work not scheduled in advance of agreement on the policy. Unfortunately, as has happened from time to time, there was a change of direction elsewhere in the organization, so this never advanced beyond a couple of experimental deals.

A couple of “be evil” observations:

Large buyers are often not really good at doing exactly the sort of analysis you wrote above. You can charge them more as a result, as you hint.

Actuarials on the supplier/insurer side now know the cost of any given outage and can feed that into decisions on how to allocate capital vs operational investment, how free you are to make changes that risk service interruptions, and so forth.

Plus it gives you another pricing knob you can turn fairly dynamically. Unbundling and pricing dynamics is often good for both parties, although many large buyers still prefer costs that are fixed in advance, even if that turns out to be more expensive. Great. Charge them more.

Two “be stupid evil” observations:

— Telcos rarely get their invoices right in the first place; adding knobs meets with resistance as a result

— Telcos far far far prefer people to make claims and throw up barriers to making one successfully. So do many insurers.

Finally: statistical service offerings are great, if you have sales channels that can cope with them. I wanted it out of their hands, with some wording in the contract that allowed for a device like a customer web page with sliders that allowed them to dynamically adjust bandwith caps, statistical delay and statistical drop parameters with a couple of presets along the lines of “[ ] UUNET quality [ ] AUCS quality”. The idea was to offer just slightly better than the competition’s SLAs and measured/reported performance at a similar price, but our much better performance for our almost always higher prices.

Discover and Recover

September 24, 2010

Softening strict requirement of optimality can make problems tractable. Put it another way it is more important to quickly narrow the search for an optimal solution to a “good enough” subset than to calculate the “perfect solution.” Ordinal (which is better) before Cardinal (value of optimum).
Compare the two scenarios presented below:

  1. Getting the best decision for certain – Cost = $1m
  2. Cost = $1m/x – Getting a decision within the top 5% With probability = 0.99*

In real life, we often settle for such a tradeoff with x=100 to 10,000

For systems that are not life-threatening, the focus should be on fast fault detection and mitigation (discover and recover) instead of exhaustively trying out every possible scenario which will make the system perfect but at such a cost that forward progress turns glacially slow. At which point your quicker, nimbler opponents will run over you with their faster product cycles.

People constantly deceive themselves into thinking that by paying much more they are getting a 100% solution but in reality you never get a true 100% solution, so its really a choice between different levels of less than perfect.

*Under independent sampling, variance decreases as 1/sqrt(n). Each order of magnitude increase in certainty requires 2 orders of magnitude increase in sampling cost. To go from p=0.99 to certainty (p=0.99999) implies a 1,000,000 fold increase in sampling cost.


Cardinal vs Ordinal work by Dr. Yu-Chi Ho

Satisficing (wikipedia)

Why WordPress and not Blogger part 2

April 21, 2010

WordPress has an android client and it is excellent. Mobile is the new laptop.

Management Books

April 11, 2010

I came across “The 12 Simple Secrets of Microsoft Management: How to Think and Act Like a Microsoft Manager and Take Your Company to the Top.” Reading it now in 2010, I can’t help but chuckle at the wide-eyed fanboy writing. Then I saw “The Google Way: How One Company Is Revolutionizing Management as We Know It and it cemented my opinion that whenever a book endorses any particular “way” of management with the benefit of hindsight and makes a point that all it would take for your company to be similarly successful is follow the bromides in the book, it is a clear sign that the person writing the book has no clue what they are going on about.

This is what the people think matters:

smarts and skill

This is actually what matters:

Luck and skill

Distributed Teams

February 2, 2010

A tip when working with distributed offices and workforces.

If you have a conference call, have all participants dial into the conference call from their own desk. Do not get a conference room with several people and have the distributed people dial in. There will be asymmetric information flow as short-turn around high-bandwidth face to face interactions drown out contributions from the distributed callers.

A quick observation on cloud economics

November 21, 2009

I just finished reading about a panel on cloud economics and the enterprise. One quote in particular stood out:

“I’m not sure there are any unit-cost advantages that are sustainable among large enterprises”

A few years ago some friends of mine had a startup publishing medical journals online. They started off by getting two fractional DS3 lines from MCI and Sprint to their office building. In the basement were a few racks of servers, storage arrays and it was off to the races. Today if someone came up with a plan of that nature, people would look at them funny and say “get a few racks from a colo provider.”  In another few years, I think the phrase is going to change to “get the compute and storage in the cloud.” The cost argument assumes today’s practice on tomorrows infrastructure. Next-generation business logic jobsets are going to be written for cloud frameworks, services and primitives, which should be more aligned with cost structures that make cloud computing more efficient per unit cycle of compute or unit bit of storage.