We shape our tools and then our tools shape us

December 26, 2017

The title words are by Father John Culkin, SJ. Re-reading them years later made me reflect on my journey here.

Memories of events are anisotropic: Memory changes the circumstances, reworks the lessons and retells the stories, so the UUNET I remember isn’t the UUNET that was, but it’ll do. Google deservedly gets the credit for SRE, but it was at UUNET that I first remember learning from Bill Barns, Mike O’Dell, and Louis Mamakos et al. that to scale in distributed systems, you had to build machines that built machines.


May 18, 2016

Artifacts are easy to copy. Culture and process that allows you to easily make new artifacts, that’s a lot harder to duplicate. This leads to cargo cults like agile, microservices, containers, open plan offices and saying “but it worked for google” while crying

Underinvestment in Infrastructure Orgs

April 21, 2015

Systemic underinvestment and gutting of IT org capabilities leads to charts that look like:

This leads to breathless headlines like this: https://vijaygill.wordpress.com/2015/04/17/before-moving-to-amazon-web-services-it-took-itv-21-days-to-make-a-single-firewall-change/

What you want to do, if you have visionary senior leaders, is to invest in the org so the graph looks something like:


Why Nothing Great Ever Came From Outsourced Teams

April 17, 2015

I am just going to quote Xs’ mail here verbatim, there is nothing that I could possibly add.

From: XX

This is all ancient history at this point, but just to offer some insight on AOL, which may have relevance to the AT&T/Apple issues (and also as a clarification to Brock’s post).

Like many large organizations, AOL (circa 1993) left up to other firms how to build out their own networks to support our needs.   We did all the marketing research, capacity planning, and internal network engineering then supplied requirements to several firms who were actually building modems in the field.  AOL then took aggregated calls on dedicated backhauls into our data centers.  We knew where we going to send disks or CDs in bulk and on what schedule, so could predict with some certainty where modems would be needed and when.  But it was up to someone else to do the heavy lifting and build racks of gear in central offices around the country based on our direction.

When the company was growing at breakneck speed, our people were often working around the clock building internal systems…. lots of 2 AM pizza deliveries, and sometimes sleeping on the floor in the office.  We were ready internally, since we essentially ran the company like a startup.  And we were motivated not just by the thrill of it all but a looming financial incentive in terms of stock options to get it right.

Our contracted modem builders (certainly at 2 of the 3 companies we partnered with) didn’t operate the same way.   Try telling the guys putting together modem banks in Iowa City or Plano that they have to pull a 16 hour shift, and oh by the way, don’t expect a weekend off for the next few months.   It’s a different world when you’re dealing with large bureaucracies, union contracts, and folks that certainly don’t have stock or other financial incentives.

So who knows what happened with the AT&T sign-up fiasco.   But there’s a cultural difference between these two companies, and just because folks at Apple are able to pull months of all-nighters to ship the new phone, doesn’t mean their technical counterparts at AT&T are equally motivated.  There probably aren’t lots of AT&T database programmers who are counting the days until their stock vests and they can retire.

And there’s no telling if Verizon had the exclusive contract that this situation would have been any better.   Of all the big mobile carriers, T-Mobile probably most closely matches Apple in terms of culture.   Perhaps no mobile network would have been able to deal with the dramatic amount of data growth when the original iPhone shipped.  In hindsight, maybe the iPhone should have been available from day one in two variants e.g. both GSM (AT&T/T-Mobile) and CDMA (Verizon).   At the minimum Apple would have had two companies to work with, and hopefully one of them was able to keep up.

You’d expect large companies to handle this kind of growth efficiently, but building these systems can be major engineering challenges.  It doesn’t take much to have the logistics get off track and suddenly you’re behind the power curve trying to catch a moving target.   AT&T management may already be cutting resources if they know full well that Apple is taking their business elsewhere in a year or two.   There’s plenty of blame to go around, but differences in culture, motivation, and economics all probably played a factor in how this whole situation came to be.

(AOL’s network guy, ’93-98)

AMD exits dense microserver business. Ends Seamicro brand

April 17, 2015

I remember a VC dinner in Palo Alto about 5-6 years ago and SeaMicro came up, I think I was seated next to one of the founders of SM. I argued that SeaMicro had no business model because the magic was in the control and scheduling software, process isolation, allocation and management, and not cheap dense cores, which were a canard. And so here we are.


Before moving to Amazon Web Services, it took ITV ’21 days to make a single firewall change’

April 17, 2015

That’s more of an indictment of ITVs technical leadership than an endorsement of aws. What were the leaders doing? http://www.computing.co.uk/ctg/news/2404193/before-moving-to-amazon-web-services-it-took-itv-21-days-to-make-a-single-firewall-change

Point of View

January 26, 2015

There is a problem with the uni-button trackpad trend on PC laptops. Thats the what happens when you don’t have a strongly developed point of view. Blindly emulating Apple’s trackpad for a system with two buttons is a problem because you can’t distinguish between two buttons with tactile feedback. With a Mac, you can get away with a uni-button trackpad because there is only one click action, with the modifiers being how many fingers are on the pad, not where on the trackpad you click. This is why building systems without a point of view is dangerous – you’ve now caused a regression in UX, because you did not fully grok the why, instead you just went ahead and did the what. This can be applied to many engineering systems.

Cloud Services and outcome-blind decision making

June 28, 2014

Whenever a cloud outage occurs [1], my social media stream is filled with people questioning the concept of cloud services in general, the competency of a company to run and operate the service, and anecdotes about the on-premise setups that haven’t taken a hit in years. This perfectly illustrates why humans in aggregate are bad at assessing risk and math.  This post doesn’t cover cloud failure modes exhaustively, rather it is targeted at risk and why directly comparing an IT shop to global service providers is not a statistically valid comparison.

The cloud risk assessment framework thought process has to account for the following factors [2]:

  • People exaggerate spectacular but rare risks and downplay common risks
  • People underestimate risks they willingly take and overestimate risks in situations they can’t control
  • People overestimate risks that are being talked about and remain an object of public scrutiny

A cloud service fits into all above categories.

Now, lets play a game:

I roll two (2) fair,  six-sided dice and sum up numbers on the top two faces, the result which will range from a minimum of 2 to a maximum of 12. The game continues for 100 rolls. Before each roll, you pick a number. If that number comes up, I pay you 1 dollar. If that number doesn’t come up, you get nothing. What strategy should you follow to maximize the amount of money won?

Try to think about this for some time. If you have some dice, try a couple of rolls before proceeding further.

The number you want to pick is exactly the same for all 100 rolls: Seven.

For each round, you cannot predict with certainty what any particular roll of the die will produce. However, regardless of what number comes up on any particular roll, you should bet on 7. This is an outcome-blind decision, because statistically with two fair dice, the highest probability sum of the top two faces is seven [3]. A different number coming up on the dice roll doesn’t invalidate the decision to bet on 7. in other words, separate decisions from outcomes. All you need to know is that over 100 runs, statistically, 7 will show up more, and therefore to maximize earnings, bet on 7.

Cloud outages dominate the news cycle, but to conduct a fair experiment, the cloud services must be compared to on-premise installs per account/minutes of availability in aggregate. Given the amount of talent and engineering effort required to run any cloud service at scale, the probability is high that for aggregate account/minutes of availability, cloud services are significantly more available than aggregate on-premise installations. If you are making outcome-blind decisions, they should favor cloud.

Whenever people mention a particular outage and compare it to some in-house implementation that hasn’t had an outage in years, point them to a good book on Poker and send them here.

Edit: people have pointed out that there are a lack of good aggregate data for on-prem. What data there are, are self-reported and noisy. A good proxy is the amount of data loss reported by the big storage systems in cloud – of which there hasn’t been any so far by the big providers. Taking KiB/month as a durability metric vs. data loss by smaller providers is a proxy for general system hygiene and competence [4].

[1] Google, Microsoft, Saleseforce, Amazon

[2] Bruce Schneier

[3] Two dice distribution

[4] Data loss report

Startups are hard in big companies and Survivorship Bias

June 21, 2014

Hunter Walk and Steven Sinofsky were articulating reasons why startups are hard inside established companies. They are by and large correct but I think they need to take survivorship bias into account.

What people remember are the ones that made it big, not the ones that died silently. For every WhatsApp, there are many BelugaPods. On the outside, you have 3-4 startups doing similar things, from which you can select the winner.  Internally, there is often only one effort. We still can’t articulate with any degree of accuracy what are the factors that make success happen, so by definition if you are taking 4-5x more shots on goal on the outside, you’ll end up with more successful outcomes. Simply because you are taking more shots. Buying a successful “disruptive” startup is discovery. 

Conversation for more details is linked below.


April 8, 2014

Any effective organization consists of organizational design (how the organization is shaped), the organizational talent, and the operating  culture. All three combine to deliver organizational capability (the value the organization delivers to the company). In other words, the culture is the operating system of the organization that allows the people to deliver value. Culture sets expectations, behaviors, operating norms and contracts that allow people to to act with efficacy, with role clarity and to refer to a common set principles that allow for effective delegated decision making and accountability. Setting context and then allowing freedom to execute has proven effective in my experience,  given the right organizational design and talent embedded in appropriate cultural norms.