Infrastructure is software

In an earlier post I mentioned that “cloud is software.”  Thinking about it some more, I believe the statement can be generalized to “Infrastructure is software.”  This is a bit different from how people have traditionally viewed it – Internet infrastructure is viewed as pipes, disks, CPUs, data centers. The collection of items that form the physical units that provide pipe, storage, compute and the buildings that house them. My thesis is that those are necessary but not sufficient to be considered infrastructure.  Those elements in and of themselves, are just so much sunk capital – to make efficient use of them you need the correct provisioning APIs, monitoring, billing, and software primitives that abstract away the underlying systems, allowing a decoupling between the various technological and business imperatives so that each layer can evolve independently based on their different technological scaling domains (within reason – if you are writing ultra-high performance code, you will know the difference if you get instantiated on an Opteron vs. a Nehalem cluster).

Lets make this concrete and think about how the above can inform the building and operations of a global service provider that has a large network, with datacenters that are used for a cloud computing business. A large telecommunications company for example that wants to provide enterprise cloud computing among a suite of services.

Basic Axioms

All things come down to the fundamental problem of mapping demand onto a set of lower level constraints. For a telecom company, constraints at the lowest level consist of:

  1. Fiber topology (or path/Right of Ways)
  2. Forwarding capacity
  3. Power & Space
  4. Follow The Money (FTM)

Everything thing else is an abstraction of the above constraints. That is the good news. The bad news: everyone has the same constraints. No special routers available to you and not to others, the speed of light is constant (modulo fiber refractive index in your physical plant), So how do you differentiate yourself? Fortunately, those are also simple:

  • Latency
  • Cost (note I did not use price for a reason)
  • Open Networks
  • Rich connectivity
  • OSS/NMS

Latency

Latency has been well documented. Some excerpts from Velocity 2009:

Eric Schurman (Bing) and Jake Brutlag (Google Search) co-presented results from latency experiments conducted independently on each site. Bing found that a 2 second slowdown changed queries/user by -1.8% and revenue/user by -4.3%. Google Search found that a 400 millisecond delay resulted in a -0.59% change in searches/user. What’s more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior. (video, slides)

Phil Dixon, from Shopzilla, had the most takeaway statistics about the impact of performance on the bottom line. A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs. (video, slides)

If you want to get into the cloud computing business, you will have to build your network and interconnection strategy to minimize latency. Your customers bottom line is at stake here, and by extension, so is your datacenter divisions P&L.

Cost

Sean Doran wrote “People that survive will be able to build a network at the lowest cost commensurate with their SLA.” He forgot to add – in a competitive market.  Assuming you are going up against competition, this should be fairly self-obvious: Efficiency and razor thin margins.  The killer App is bandwidth, and this means people need to emulate  Walmart ™. Learn to survive  on 10%  or lower margins. At those margins, your OSS/NMS are competitive advantages.  Every manual touch point in the business, every support call for a delayed order, failure in provisioning,  every salesperson that sells a service that can’t be provisioned properly, nibbles at the margin. Software that can provision the network,  enable fast turn up, proper accounting and auditing is the key.

And we react with great caution to suggestions that our poor businesses can be restored to satisfactory profitability by major capital expenditures.  (The projections will be dazzling – the advocates will be sincere – but, in the end, major additional investment in a terrible industry usually is about as rewarding as struggling in quicksand.)
-Warren Buffet

Efficiency also means fewer operational Issues. Couple ever increasing number of elements with ever growing mass of policy and you now are starting to lose any semblance of troubleshooting and operational simplicity. Does the network pass the 3 AM on-call test? More policy means more forwarding complexity, and that means more cost that hits your bottom line. A more insidious effect of intelligent, complex networks is that they inhibit experimentation. The theory of Real Options points out that experimentation is valuable when market uncertainty is high. Therefore, designing an architecture that fosters experimentation at the edge creates potential for greater value than centralized administration, because distributed structures promotes innovation and enables experimentation at low cost. This means that by putting the intelligence in the applications, rather than the network is a better use of capital – because otherwise, applications that don’t need that robustness will end up paying for it, and this will end up making experimentation expensive.

Open Networks

Open networks strikes fear into the heart of service providers everywhere.  If you are in a commodity business, how differentiate yourself?  How about providing service that works well, cheaply.  But wait a minute!  Whatever happened  to “climb up the value chain?” The answer is nothing. You have to decide what business you are in.  Moving up the value chain and providing ever higher-touch services are in direct conflict with providing low cost bulk bandwidth.  Pick businesses that require either massive horizontal scaling or deep vertical scaling. Picking both leaves you vulnerable to more narrowly focused competitors in each segment. If horizontal scaling is central to one business, trying to fit an orthogonal model also as a core business will end up annoying everyone and serving no one well.  However, if the software interface to the horizontal business is exposed to the vertical high-touch side of the business, both can be decoupled from each other and allowed to scale independently.  This means things like provisioning, SLA reporting, billing, usage reporting all exposed via software mechanisms.

Rich Connectivity

Let me start off by saying content is not king.

Gaming companies are making the same mistakes as the
content guys. They always over-estimate the importance of
the content and vastly underestimate the desire of users/people
to communicate with each other and share…
-Joi Ito

The Internet is a network of networks. The real value of a network is realized when it connects to other networks, more detail can be found in  Metcalfe’sLaw, and Reed’s Law.  Making interconnections with other networks harder than is necessary will eventually result in isolation and a drive to irrelevance (in an open market).  If people who are transiting your network to get to another network find that the interconnection between your network and their destination network is chronically congested or adds significant latency, the incentive to directly interconnect with the destination network or find another upstream becomes stronger.

It ain’t the metal, it ain’t the glass; it’s the wetware.
-Tony Li

OSS/NMS

Make the network be database authoritative.  This will allow for faster provisioning, consistency, auditing. You can tell authoritatively if two buildings across the country or the world are on-net and more importantly, if they can be connected together in what timeframe. This is especially true if you have a few acqusitions with a mixture of assets. Just mashing together the list of buildings that are now on-net with the merged entity doesn’t actually tell you if they can be connected together easily or through several different fiber runs, patch panels, and networks.  If the provisioning systems were correct, the sales folks could tell prospective customers when services could be delivered because they’d know if connecting two buildings involved ordering cross-connects or if it involved doing a fiber build. We provision thousands of machines automatically, why treat thousands of routers differently? The systems that automatically provision and scale your network are hard to implement, but they can be built. It only requires the force of will to make it happen.

All these things give a better quality of service to the end user and are a competitive advantage in reducing OPEX and SLA payouts due to error in configurations. You can futher extend your systems to do things like automatic rollbacks if you make a change and something goes wrong.

Software is the key, no matter what your business is if it deals with the internet and it will be increasingly true going forward.

Advertisements

5 Responses to Infrastructure is software

  1. JZP says:

    I find it interesting (and odd. and sad.) that this needs to be said. “Sad” when thinking about what this implies for the state of practical education for packet-switched network design & operation. “Odd” that more of this isn’t just obvious to people involved.

    Perhaps, the technical education that targets certification has been pigeonholing folks too much. The 03:00 test, smart edges/dumb network, and focusing on high-automation/low manual intervention were all topics straight from the start up days for most of us, no? It was either 1994 or 1995 when I was pushing the view that my group of devices being configured was just a system being programmed. As such, peer review and other error-finding methods borrowed from software engineering just made sense. Same goes for “fully specifying a maintenance such that the 3rd shift can do it”.

    Why are these approaches not intuitive to people? Do they really think some new whiz-bang [web2.0, cloud, whatever] that relies on a packet-switched infrastructure somehow operates by different rules?

  2. great post Vijay – this is exactly how we’ve approached the concepts of distributed computing and the front end latency implications vis-a-vis new supply chain paradim of: Infrastructure > Platform > Application – the latter of which is what most folks will actually care about and that is the business (not just building for an engineer’s sake)

    I’ve started to introduce “close-coupled clouds” as those that are within the 80KM distance threshold (of say VMWare’s VMotion) and true distributed cloud infrastructure that is designed to serve an application up to its end-users with the least amount of cost and least amount of latency – being inexorably intertwined business decisions with technical implications…do you think that there are, besides Google today (and would love to pick your brain on that) truly distributed clouds that seamlessly interconnect so that the end-users can actually manage latency requirements today?

  3. director of cloud says:

    One problem with having OSS/NMS running the configuration show is that people do screw it up, badly. Its a great concept until things turn backwards and you end up having to spend more development on system work.

    For example, in some of these companies that go this route, any new feature, command or functionality would normally take a few lines on the CLI. Now, you have to push that into backend systems to do the work (and comprehend the changes when they see it in the network). Which for most reasonable people, is a fair request. The problem is that in monolithic entities, all systems work is not in the hands of clueful router folks who have system/scripting/code skills. In these monolithic organizations, all systems work is typically farmed out across the globe to someone who has no idea about CLI syntax or how a router works. In addition, these dinosaur companies have long IT release cycles that don’t always match up with whats going on in the network side of things.

    So, my point is if you are going to go down the road of OSS/NMS work, keep it tightly coupled with the folks who run the network engineering side of the business. Have a team of developers working side-by-side with the router folks to so both can understand each other and you can build applications that actually help, than burden people. If you lump OSS/NMS in with the same organization responsible for writing billing applications and human resources work, prepare to fail.

  4. […] innovative new services – they don’t have the DNA to do so (see my posts here and here) and that content is not king – connectivity is the real value. The great success story of the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: