On Terminology And Names

November 3, 2009

“It is important to distinguish between the concepts of an object, and the name(s) of that object. This has resulted in widespread confusion between the properties of the name, and those of the object itself.” A great line by J. Noel Chiappa and one very applicable to the process of Settlement Free Interconnection (SFI) or “peering” as it is commonly known.

When people say “peering” what they most often mean is a bi-lateral settlement free interconnection. This is a business term, not a technical term, because the properties of the interconnection are orthogonal to the business relationship that causes that interconnection to exist. What are these properties?

  1. Customer and network infrastructures routes (and only those routes) are exchanged
  2. Transit (peer routes, exchange routes) are not exchanged

From #1 and #2 above, only “on-net” routes are exchanged, which means that only “on-net” traffic destined for the network’s customers and infrastructure is exchanged.

That is the technical property of a “peering” session. The flow of money is orthogonal to the mechanics of interconnection. If there is a contract or some financial relationship between the two networks, then it is termed either Settlement Based Interconnect (SBI) or “paid peering.” The properties of the interconnect remain unchanged.

So to sum up, I would like to use the following terms for interconnection universally:

  • Interconnection with “on-net” routes and no settlement: SFI
  • Interconnection with “on-net” routes and settlements: SBI

Innovation and Outsourcing

October 13, 2009

Risk:

The CEO of Air New Zealand had this to say on their supplier:

“We were left high and dry and this is simply unacceptable. My expectations of IBM were far higher than the amateur results that were delivered yesterday, and I have been left with no option but to ask the IT team to review the full range of options available to us to ensure we have an IT supplier whom we have confidence in and one who understands and is fully committed to our business and the needs of our customers.”

Reward:

Fake Steve Jobs had this to say:

See, those outsourcing deals always sounded so good: Why do you want to run a messy old data center anyway? We can do it for less than it costs you to do it yourself, and you can focus on your real core competence, which is running an airline.
Except, um, no. An airline’s core competence is running computers. I mean, think about it. Duh

Thing is, these guys did think about it. They knew the deal, but they did it anyway. You know why? Because they got to take a bunch of assets off their balance sheet and send a few hundred IT employees to IBM. It was an accounting maneuver, a way to dress up their financial reports, and it was especially appealing to weak companies. IBM takes your data center off your hands — and in some cases even pays you some money — and then sells it back to you as a service over the next decade.

If you are outsourcing, your cost advantage is lost, and not only is your cost advantage going to go away, there are some things that you are never going to be able to do. One can argue that it would make the most sense for someone like Google to focus on their core competency, not waste time building servers.  But not only are they building servers, the fact that they viewed it as a core competency allowed them to make things better by optimizing the system, including on-board batteries which enabled datacenters without centralized UPS’s.

People define core competencies far too narrowly. It is not simply that someone chose to view building servers as a core competency, it is that they saw the massive advantage to all their efforts of controlling their infrastructure destiny as an enabler and thus took it as a core competency.

Those leaps of innovation are just not going to happen if you are focusing on your “core competencies” while letting others build your infrastructure. It can be argued that at Google’s scale, servers are a core competency – for example no one is going to argue that if you need a 1000 servers, you are better off  using a reverse auction, but if you are a global service provider, you are not building 1000 servers, you are in fact, working on your core competency, a point which does not seem as clear as it perhaps may appear.  How are you going to avoid being a dumb pipe if you can’t even control your own infrastructure at scale?

Edit: Benjamin Black added clarification


Danger Data Loss

October 12, 2009

I have been following the news of the Microsoft/T-Mobile danger user data loss and how it puts cloud computing in a bad light. First, I’d like to echo John Bradford: “There but for the grace of God go I.”  As an operations guy first and foremost, my thoughts are with the people on the ground working this problem. I’ve slumped heartbroken in my chair more than once over backup tapes that tested fine but won’t restore. Operations are hard, systemic failure is harder and is very difficult to test for. However, there are some basic points I’d like to bring up:

  • Cloud service providers are not all the same. Like car manufacturers,  various cloud providers will optimize for different things.
  • Safeguarding of the user data is a process and mentality that needs to be deeply ingrained. Keeping users data safe should be paramount.
  • If the failure of your cloud provider causes your business to suffer, this is your problem, not your cloud providers. You are responsible for your uptime and you have to engineer for it.
  • Cloud storage is a convenience/risk trade-off and almost everyone will pick convenience. Humans are not very good at proper risk assessment.

If everyone in your operations team quits one day, are there procedures and processes in place that allow someone else to step in and operate the service, including the backups? Do you run tests every so often that simulate failure, including restoring data and verification that everything works?  Did you test restoring the system under peak, not steady load?

Backing up the bits is just the start of the entire operations procedure, not the end as most people have it.


Streaming Video

September 27, 2009

After seeing several red envelopes at a friends house the other day I started to wonder what Netflix’s cost structure would look like if streaming video replaced sending DVDs by mail.   Mindful of Andrew Tanenbaum’s adage about never underestimating the bandwidth of a station wagon full of tapes hurtling down the highway, I thought it might make sense to do a cost model and see if streaming DVDs would be as cost effective as shipping them. This is a very simple model that does not take into account several crucial factors such as the First Sale doctrine, licensing for streaming,  partnering with studios instead of sourcing DVDs etc.  Leaving those aside and focusing on the technological aspect of the cost modeling is still quite illuminating.

The model is based on the NETFLIX INC (NFLX) 10-Q filed 7/31/2009. The results were quite surprising.

The article was improved thanks to contributions by Ben Black, Randy Epstein and Alex Pilosov


Peering Policy Analysis

September 8, 2009

Peering or Settlement-free Interconnect (SFI), is a contentious subject as can be seen here and here. Having been involved in a few SFI negotiations and and disputes myself, I thought I’d write my analysis using an existing SFI policy as a vehicle for the analysis.

First, what is SFI? Simply put it is the bilateral exchange of  two service provider (SP)’s customer routes without payment by either side (settlement-free). A more detailed explanation can be found here.

The technical details and various modes of the peering definition could go on for quite some time, but the question at the heart of the matter is: “will provider X interconnect with me on a settlement-free basis?” Network Service Providers want to connect to other networks on a settlement-free basis because it allows them to exchange traffic for free with them, without having to pay an upstream to carry their traffic. The upstream providers do not want to interconnect on a settlement-free basis because they lose revenue.

Geoff Huston has a very good statement of what settlement-free interconnection really means.

The bottom line is that a true peer relationship is based on the supposition that either party can terminate the interconnection relationship and that the other party does not consider such an action a competitively hostile act. If one party has a high reliance on the interconnection arrangement and the other does not, then the most stable business outcome is that this reliance is expressed in terms of a service contract with the other party, and a provider/client relationship is established.

Like taking margin in the retail industry, SFI will only be granted if the benefits of interconnection outweigh the cost. It really is that simple.

With that in mind, let us take a current SFI Policy and analyze the technical aspects. To ground the discussion in reality, I will use the Comcast SFI Policy as of September 2009. It is a good example of a well-written, modern SFI policy. Comcast policy text is in blue.

Applicant must operate a US-wide IP backbone whose links are primarily 10 Gbps or greater.

This is to ensure that the applicant’s network is similar to Comcast in size and has a similar cost basis. Traffic engineering and management are simplified due to similar bandwidth on the interconnecting backbones as traffic flows tend to be of similar size. There have been people who have interconnected at 10G with the backhaul restricted to STM-1/OC-3 links, causing saturation and a poor user experience.

Applicant must meet Comcast at a minimum of four mutually agreeable geographically diverse points in the US. Interconnection points must include at least one city on the US east coast, one in the central region, and one on the US west coast, and must currently be chosen from Comcast peering points in the following list of metropolitan areas: New York City/Newark NJ, Ashburn, Atlanta, Miami, Chicago, Denver, Dallas, Los Angeles, Palo Alto/San Jose, and Seattle.

This clause ensures that the applicants network is similar to Comcast in scope (and has a similar cost basis) and has the same redundancy, size, and diversity of connection that allows Comcast to easily integrate the interconnection and session management into their traffic engineering and operational procedures.

Applicant’s traffic to/from the Comcast network must be on-net only and must amount to at least 7 Gbps peak in the dominant direction. Interconnection bandwidth must be at least 10 Gbps at each interconnection point.

This requirement ensures that the network is at par with other SFI networks, making traffic engineering and operational management easier.  It should be subject to change regularly based on network evolution. The only thing I would change in the requirement is to substitute average for peak. With peak and 95th percentile a small number of samples dominate the calculation.  With average, that is not the case. Peak and 95th percentile are relatively easy to game, not so with average. Any metric that allows dominance of the outcome by a small set of samples is contraindicated in peering calculations, whereas in customer/provider relationships they are preferred by providers. The former situation is optimized for volume and the latter is optimized for rate.

A network (ASN) that is a customer of a Comcast network for any dedicated IP services may not simultaneously be a settlement-free network peer.

This requirement has caused more confusion than any other clause to my knowledge. Most people interpret this to mean “once a customer, always a customer, with no possibility of getting SFI in the future.”  This is quite incorrect. What it actually means is that if you are a customer, you cannot simultaneously interconnect for free for on-net routes. This comes up when customers want only to pay for “off-net” traffic and is implemented by the provider by setting up multiple interconnections.  Announce customer routes (the on-net traffic) on some interconnections and only announce  peer (or off-net) routes on others. If the provider offers this option there are many ways to game it. This requirement is self-defense and eliminates operational complexity.

Applicant must have a professionally managed 24×7 NOC and agree to repair or otherwise remedy any problems within a reasonable timeframe. Applicant must also agree to actively cooperate to resolve security incidents, denial of service attacks, and other operational problems.

Applicant must maintain responsive abuse contacts for reporting and dealing with UCE (Unsolicited Commercial Email), technical contact information for capacity planning and provisioning and administrative contacts for all legal notices.

This requirement ensures that there is a good point of contact that is reachable at any time, considerably simplifying technical and policy coordination between networks.

Applicant must agree to participate in joint capacity reviews at pre-set intervals and work towards timely augments as identified.

Traffic forecasting and pre-planning for capital expenditutures, metro and PoP upgrades is essential as they take time to get deployed in the field.

Applicant must maintain a traffic scale between its network and Comcast that enables a general balance of inbound versus outbound traffic. The network cost burden for carrying traffic between networks shall be similar to justify SFI.

This  is another very controversial requirement – the so-called ‘Ratio clause.’  The best way to look at it is via the Geoff Huston definition above, any other way of looking at this is doomed to failure. This requirement serves as another way to ensure that the interconnection applicant has a similar scale and scope network as Comcast, with a similar cost basis as measured by the cost of carriage of a bit/mile.

Applicant must abide by the following routing policy:
Applicant must use the same peering AS at each US interconnection point and must announce a consistent set of routes at each point, unless otherwise mutually agreed.

Consistent route announcements are useful to prevent gaming (see ratio requirement mentioned earlier), help in troubleshooting and traffic engineering.

No transit or third party routes are to be announced; all routes exchanged must be Applicant’s and Applicant’s customers’ routes.

If a network starts announcing transit or third party routes, those prefixes will interfere with normal routing and traffic engineering, potentially severely disrupting Internet connectivity for customers. Sending a large amount of transit routes can also potentially double or triple the number of paths in the routers, causing them to run out of resources and crash.

Applicant must filter route announcements from their customers by prefix.

Customer routes are preferred in most networks, and are announced to other SFI networks as the best path to reach that customer. If the customer makes an error such as leaking another providers upstream routes, it can cause significant disruption. For example, by making the customer look like it has the the best route to that upstream provider. The wrong information may be propagated to Comcast and their SFI networks, causing traffic to to be incorrectly routed.

Neither party shall abuse the SFI network peering relationship by engaging in activities such as, but not limited to: pointing a default route at the other or otherwise forwarding traffic for destinations not explicitly advertised, resetting next-hop, selling or giving next-hop to others.
Applicant should be willing to enter into an NDA before formal discussions begin.


The abuse requirement simply says do not try to steal service by pointing a default, or faking next-hops.  The NDA requirement is quite standard when entering into negotiations for something as sensitive as SFI.

Applicant should be advised that the SFI processes will start with a 90 day trial.  On successful completion of that trial, a formal interconnect agreement will be processed.  This agreement will renew annually, subject to the then current SFI Policy.  During the year if there is a violation of the policy, the agreement and interconnections may be terminated upon written notice to the contacts specified in the agreement.
A 90 Day trial to verify that the traffic, ratio and other technical conditions are satisfied is reasonable. It allows for sufficient time to verify the claims for volume and ratio, but is not so long that it starts looking like  a revenue generation mechanism.

Applicant shall not be permitted to offer or sell any IP transit services providing only AS7922.

This particular requirement prevents networks that meet the SFI requirements from selling cheap, direct access to the Comcast network to networks who otherwise do not meet Comcast SFI requirements.  This violates the equivalent cost basis argument for SFI.

Applicant must be financially stable.
Comcast requires that Applicants seeking SFI in the United States agree to provide reciprocal SFI arrangement with Comcast in the Applicant’s home market.

Excellent clauses. Comcast is US centric (for now). If they ever expand out to different geographies, there is a ready-made interconnection system in place.

This is a good, rigorous policy that sets out a fair, even-handed system of evaluation for SFI with Comcast. The requirements are clear, well articulated and make technical sense and that makes a sensible trade-off between of cost of interconnection and the value to the Comcast customer base.

Article was vastly improved thanks to editing and wordsmithing help from Ben Black.


Femtocells

August 26, 2009

Om Malik wrote an interesting piece on Femtocells and the failures in Fixed Mobile Convergence (FMC).  Quoting from the article:

According to The Wall Street Journal, femtocells aren’t doing terribly well — sales are slow and demand is weak. It’s a classic chicken-and-egg situation. Carriers are waiting for demand to go up, while folks (like me) are waiting for prices — which currently range from $100 to $250 for the device alone, plus a monthly service fee — to come down.

The rest of the article goes into some details as to what the issues are but what jumps out is the phrase “plus a monthly service fee.”  This encapsulates precisely what I believe is wrong in the telecom world -more focus on small incremental revenues instead of looking at what service and value can be provided to make the customers happy.  The mobile industry is one of the industries where 15%-25% of their entire customer base churns out every year. What would it look like if the churn was an order of magnitude less?  Let’s see what the benefits of a femtocell are:

  • Remove load from the spectrum allocation and tower backhaul (scarce resources)
  • Improve the customer experience
  • Possibly reduce tower density (and associated cost with rental, power, backhaul)

For all this, you expect the customer to pay you to put a femtocell in their house? How about offering customers a discount for calls made via femtocell?

Now comes the delicate balancing act of figuring out who pays for the femtocell?  One option is to have customers buy them outright. Another one is to sell a discounted version, but extend the contract.  Asking for a monthly payment when the customer who is buying the device is unhappy with the coverage is just adding insult to injury.


Cloud Part 2

August 25, 2009

Joe Weinman wrote an article on cloud computing titled 10 Reasons Why Telcos Will Dominate Enterprise Cloud Computing. My response to that article is here. Today, I was tracerouting to  ATT.com. The results were surprising, so I did some more digging around.

vgill$ dig www.att.com

; <<>> DiG 9.4.3-P1 <<>> www.att.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4018
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.att.com.                   IN      A

;; ANSWER SECTION:
www.att.com.            254     IN      CNAME   www.att.com.edgekey.net.
www.att.com.edgekey.net. 19856  IN      CNAME   e2318.c.akamaiedge.net.
e2318.c.akamaiedge.net. 11      IN      A       96.6.249.145

;; Query time: 18 msec
;; SERVER: 68.87.76.178#53(68.87.76.178)
;; WHEN: Mon Aug 24 22:36:30 2009
;; MSG SIZE  rcvd: 115

ATT.com points to a CDN service provided by Akamai!


Good vs Great: Hiring and Maintaining Engineering Organizations

August 16, 2009

Richard Hamming gave an interesting talk at a seminar in March 1986 called “You and Your Research.” The entire talk is worth reading and rereading, but one particular section is very relevant to my interest as a hiring manager whose goal has been to build the best possible organization.  In it, Dr. Hamming poses and answers a question: If it is so easy, so why do so many people, with all their talents,  fail?

Well, one of the reasons is drive and commitment. The people who do great work with less ability but who are committed to it, get more done that those who have great skill and dabble in it, who work during the day and go home and do other things and comeback and work the next day. They don’t have the deep commitment that is apparently necessary for really first-class work. They turn out lots of good work, but we were talking, remember, about first-class work. There is a difference. Good people, very talented people, almost always turn out good work. We’re talking about the outstanding work, the type of work that gets the Nobel Prize and gets recognition.

This ties back to a fundamental thesis that it is necessary but not sufficient to be smart. Getting things done is equally important.  When hiring and building a team, you have to hire people who are sharp and and can get down to the block and tackle of execution – people who are committed to seeing their work come to fruition.  How do you find and hire these people? Ideally, first hand observation of their work – folks you have worked directly with. There is simply no interview cycle that can substitute for first hand experience.  Go over a list of people you know, pull out the contact information, call the top performers. Talk to the team, ask them to pull out some names of people they have worked with. Follow up with them. As a hiring manager, recruiting a top notch team is top of the list. Even if you have no openings right now, keep current with what insanely great people are doing. An unexpected need  may come up at any time. This leads to my second point – for top level design and engineering work, it is sometimes better to leave the position open than settle and hire out of desperation for any reason. “My headcount will get taken away” or “Someone is better than no one.” The penalties for bad hires will be worse than the work you have to drop because you did not have a candidate in place.  Stack ranking existing work will often prove to be quite illuminating – some of the must-do projects turn out to not so when examined with the critical eye that dispassionately judges projects on merit, stripped of egos tied into work. Bad hires will not only sap morale from the group, lead to people covering for the work not being executed to standards expected and often frustrate your top performers. They will also sap valuable time and energy from you that could be spent more productively  on the top performers.

This may be controversial, but I have always felt that the top performers should get the bulk of your attention, because they will flourish and produce disproportionately more. A simple exercise may help clarify the point:

Assume you have 1 unit of attention to spend and you will gain a result that is proportional to the ability of the person you spend it on.  Now, you can choose to spend it on someone who is 5 times better than the average person, or someone who is average. If you spend that attention on the top performer, you will gain 5 times the result you would if you spent it on the lower performer. As Joel has pointed out here, the top performers are sometimes 10x better than the average.  They should get the bulk of your time and attention. Spend the time and treasure to make your best people insanely great.

Spend time working with the lower performing people (in their current positions) to find out what roles would suit them better and then get them into those jobs where they have a better chance of flourishing. Not everyone is going to be successful in all roles, play to their strengths rather than trying to shore up their major weaknesses. This does not mean do not round them out, but do not expect someone weak in a particular area to thrive in a role that has a great deal of emphasis on abilities that are not their core strengths. The best thing for everyone involved is align strengths with roles.

Assuming you have done the above – hired great people who can get things done, and made sure the organization is firing on all cylinders – now you need to get out of the way. Do not specify methodologies, only directions. Make sure results are measured and known. You’ve hired insanely great people, trust them to do the job. Your job at this point is to make sure they are not distracted, remove obstacles, listen to their griping, and ease the way so they can get stuff done. Oversee, make sure the big picture is communicated consistently, trust and keep your team in the loop and occasionally nudge them back on the rails and in cases of logjam – arbitrate fairly. One great technique I’ve picked up here is that if the team cannot decide between different approaches, a decision will be made in a week: I will pick one of the alternatives at random. This has done wonders for the team reaching a consensus.

Note that consensus doesn’t mean unanimity – waiting for unanimity is a good way to make sure that productivity goes to zero. Consensus means that once there is a decision, the entire team rallies around the decision and then focuses on getting things done. Which is what they are good at, so you’re in the clear there.

Make no mistake – this is all very hard work. Hiring, keeping your contacts in place, working with the top performers, arbitrating and keeping the way clear to allow your team to execute will not be easy. However the reward of watching  your team go above and beyond what they thought, grow and stretch and work miracles will be worth every review, every meeting, every phone call at dinner time to a potential candidate in a different time zone.


Infrastructure is software

July 22, 2009

In an earlier post I mentioned that “cloud is software.”  Thinking about it some more, I believe the statement can be generalized to “Infrastructure is software.”  This is a bit different from how people have traditionally viewed it – Internet infrastructure is viewed as pipes, disks, CPUs, data centers. The collection of items that form the physical units that provide pipe, storage, compute and the buildings that house them. My thesis is that those are necessary but not sufficient to be considered infrastructure.  Those elements in and of themselves, are just so much sunk capital – to make efficient use of them you need the correct provisioning APIs, monitoring, billing, and software primitives that abstract away the underlying systems, allowing a decoupling between the various technological and business imperatives so that each layer can evolve independently based on their different technological scaling domains (within reason – if you are writing ultra-high performance code, you will know the difference if you get instantiated on an Opteron vs. a Nehalem cluster).

Lets make this concrete and think about how the above can inform the building and operations of a global service provider that has a large network, with datacenters that are used for a cloud computing business. A large telecommunications company for example that wants to provide enterprise cloud computing among a suite of services.

Basic Axioms

All things come down to the fundamental problem of mapping demand onto a set of lower level constraints. For a telecom company, constraints at the lowest level consist of:

  1. Fiber topology (or path/Right of Ways)
  2. Forwarding capacity
  3. Power & Space
  4. Follow The Money (FTM)

Everything thing else is an abstraction of the above constraints. That is the good news. The bad news: everyone has the same constraints. No special routers available to you and not to others, the speed of light is constant (modulo fiber refractive index in your physical plant), So how do you differentiate yourself? Fortunately, those are also simple:

  • Latency
  • Cost (note I did not use price for a reason)
  • Open Networks
  • Rich connectivity
  • OSS/NMS

Latency

Latency has been well documented. Some excerpts from Velocity 2009:

Eric Schurman (Bing) and Jake Brutlag (Google Search) co-presented results from latency experiments conducted independently on each site. Bing found that a 2 second slowdown changed queries/user by -1.8% and revenue/user by -4.3%. Google Search found that a 400 millisecond delay resulted in a -0.59% change in searches/user. What’s more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior. (video, slides)

Phil Dixon, from Shopzilla, had the most takeaway statistics about the impact of performance on the bottom line. A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs. (video, slides)

If you want to get into the cloud computing business, you will have to build your network and interconnection strategy to minimize latency. Your customers bottom line is at stake here, and by extension, so is your datacenter divisions P&L.

Cost

Sean Doran wrote “People that survive will be able to build a network at the lowest cost commensurate with their SLA.” He forgot to add – in a competitive market.  Assuming you are going up against competition, this should be fairly self-obvious: Efficiency and razor thin margins.  The killer App is bandwidth, and this means people need to emulate  Walmart ™. Learn to survive  on 10%  or lower margins. At those margins, your OSS/NMS are competitive advantages.  Every manual touch point in the business, every support call for a delayed order, failure in provisioning,  every salesperson that sells a service that can’t be provisioned properly, nibbles at the margin. Software that can provision the network,  enable fast turn up, proper accounting and auditing is the key.

And we react with great caution to suggestions that our poor businesses can be restored to satisfactory profitability by major capital expenditures.  (The projections will be dazzling – the advocates will be sincere – but, in the end, major additional investment in a terrible industry usually is about as rewarding as struggling in quicksand.)
-Warren Buffet

Efficiency also means fewer operational Issues. Couple ever increasing number of elements with ever growing mass of policy and you now are starting to lose any semblance of troubleshooting and operational simplicity. Does the network pass the 3 AM on-call test? More policy means more forwarding complexity, and that means more cost that hits your bottom line. A more insidious effect of intelligent, complex networks is that they inhibit experimentation. The theory of Real Options points out that experimentation is valuable when market uncertainty is high. Therefore, designing an architecture that fosters experimentation at the edge creates potential for greater value than centralized administration, because distributed structures promotes innovation and enables experimentation at low cost. This means that by putting the intelligence in the applications, rather than the network is a better use of capital – because otherwise, applications that don’t need that robustness will end up paying for it, and this will end up making experimentation expensive.

Open Networks

Open networks strikes fear into the heart of service providers everywhere.  If you are in a commodity business, how differentiate yourself?  How about providing service that works well, cheaply.  But wait a minute!  Whatever happened  to “climb up the value chain?” The answer is nothing. You have to decide what business you are in.  Moving up the value chain and providing ever higher-touch services are in direct conflict with providing low cost bulk bandwidth.  Pick businesses that require either massive horizontal scaling or deep vertical scaling. Picking both leaves you vulnerable to more narrowly focused competitors in each segment. If horizontal scaling is central to one business, trying to fit an orthogonal model also as a core business will end up annoying everyone and serving no one well.  However, if the software interface to the horizontal business is exposed to the vertical high-touch side of the business, both can be decoupled from each other and allowed to scale independently.  This means things like provisioning, SLA reporting, billing, usage reporting all exposed via software mechanisms.

Rich Connectivity

Let me start off by saying content is not king.

Gaming companies are making the same mistakes as the
content guys. They always over-estimate the importance of
the content and vastly underestimate the desire of users/people
to communicate with each other and share…
-Joi Ito

The Internet is a network of networks. The real value of a network is realized when it connects to other networks, more detail can be found in  Metcalfe’sLaw, and Reed’s Law.  Making interconnections with other networks harder than is necessary will eventually result in isolation and a drive to irrelevance (in an open market).  If people who are transiting your network to get to another network find that the interconnection between your network and their destination network is chronically congested or adds significant latency, the incentive to directly interconnect with the destination network or find another upstream becomes stronger.

It ain’t the metal, it ain’t the glass; it’s the wetware.
-Tony Li

OSS/NMS

Make the network be database authoritative.  This will allow for faster provisioning, consistency, auditing. You can tell authoritatively if two buildings across the country or the world are on-net and more importantly, if they can be connected together in what timeframe. This is especially true if you have a few acqusitions with a mixture of assets. Just mashing together the list of buildings that are now on-net with the merged entity doesn’t actually tell you if they can be connected together easily or through several different fiber runs, patch panels, and networks.  If the provisioning systems were correct, the sales folks could tell prospective customers when services could be delivered because they’d know if connecting two buildings involved ordering cross-connects or if it involved doing a fiber build. We provision thousands of machines automatically, why treat thousands of routers differently? The systems that automatically provision and scale your network are hard to implement, but they can be built. It only requires the force of will to make it happen.

All these things give a better quality of service to the end user and are a competitive advantage in reducing OPEX and SLA payouts due to error in configurations. You can futher extend your systems to do things like automatic rollbacks if you make a change and something goes wrong.

Software is the key, no matter what your business is if it deals with the internet and it will be increasingly true going forward.


Cloud

July 15, 2009

Joe Weinman wrote an article on cloud computing titled 10 Reasons Why Telcos Will Dominate Enterprise Cloud Computing.  Lets go over them point by point, but first let me point out I think Joe is an excellent guy and a friend of mine, however he is not going to get a free pass on this. Without further ado:

(1) Enterprise sales capability –  … Unlike their consumer or start-up counterparts, enterprise CIOs do not want to go online to initiate and manage a relationship. They want dedicated account teams collaborating closely with them and their teams for the long term, in many cases with a permanent on-site presence….

This is not a competitive advantage. Large dedicated account teams are not an insurmountable  barrier to entry and I would argue, are the easiest thing to build up. Existing relationships can only take you so far, and in the end what is really going to drive business is can you offer a better product for less.

(2) Lifecycle service and support — … advanced tooling for service monitoring and management; portals for network and application performance, usage monitoring and configuration and provisioning changes; and even e-bonding between enterprise systems and service provider systems.

I think this might be harder than it appears on the surface. “…portals and application performance, usage monitoring…provisioning changes.” Really? Almost all of the current billing systems, portals and application monitoring and provisioning services are outsourced. If they are outsourced, how is that a barrier for anyone? Lets take a look at how it worked out for the iPhone, arguably something one can consider a core competency that is much more closely related to the heart of what a telco does, than say large server farms: iPhone activation. So much for e-bonding between Apple and ATT.

(3) Reliable operations at scale — Rather than offering services that still remain in “Preview Release” or permanent “Beta” purgatory after many years to avoid any implied service reliability or feature stability commitments, service providers go through a comprehensive suite of pre-launch interoperability, certification, and scalability engineering and testing. In fact, telcos are used to engineering services for four or five nines of availability, even as they scale up to tens of millions of customers.

Joe, seriously? See the iPhone provisioning issues above.   The cloud isn’t a collection of salespeople and after-sales support. Cloud is software and you can’t build good software via contracting.  Why?  Because the telecom companies simply do not have the talent in-house to do what it takes and body shops aren’t going to have really top notch guys working for them.  Back in the glory days of ATT labs, when people like David Presotto, Rob Pike, Ken Thompson et al. were at ATT, the telecom companies had the best software talent in the world. With the current outsourcing trend, their best talent is on the far side of an Infosys contract. And guess what, your competitors can just as easily send a check for the same work, probably to the same people and get the same level of work. Scratch that competitive advantage.  Outsourced software coders  are not going to get you software tools like Amazon Dynamo, Facebook’s Hive, Microsofts Azure, Google’s BigTable. Any recent compute related advancements from any telco company that compare to Hive or Dynamo?

(4) SLAs with financial penalties — Not only won’t enterprises accept “Well, after all, it’s still in beta” as an excuse for service outages, they demand meaningful SLAs (service level agreements) with clear metrics for evaluating achievement of those SLAs, backed up by monitoring and management systems, and financial penalties such as credits or refunds if service levels aren’t met….

I don’t see the competitive barrier here.  SLAs are an actuarial game that anyone can play. Edited to add Ben Blacks take on SLAs here and here .

(5) Full enterprise solutions portfolio — …. Related services such as network access and transport, MPLS VPNs for backhauling to the enterprise datacenter, application management, global load balancing, asymmetric Web acceleration, network-based firewalls and other network-based security services, content delivery, Voice over IP, Video over IP, managed messaging, Web conferencing and remote access can offer synergies when combined with cloud computing and storage.

Global load balancing? Do we see a telecom company solution in this space? No. We see Amazon CloudFront, Brocade GSLB, Akamai, Limelight et al. Web acceleration – quick, name the top 10 web properties in the US. They appear to be doing ok, without telecom expertise in GSLB, inconceivable as it may appear. Video over IP? Hulu, youtube? Managed Messaging, Web Conferencing? Sametime and WebEX by Cisco. I don’t understand this point, maybe someone can help me here.

(6) Integrated hosting and network services — … It generates cost advantages in a number of ways. First, having hosting facilities on net — that is, in the same locations as core network backbone switching and routing facilities — eliminates expenses associated with building additional access facilities to reach a third-party datacenter. Integrated providers also can access network facilities at cost, rather than at market prices. And larger providers should be able to achieve more compelling economies of scale. Having hosting facilities on net also means better performance by reducing router hops and associated physical propagation delays.

Building really large datacenters at scale for commercially competitive businesses means there are certain restrictions on where you can build them. Namely, putting a few hundred megawatt facility in heart of Manhattan is going to be non-competitive. Lets take a look at what Microsoft has been up to with their Chicago facility. I quote Mike Manos: From the article

“If I’m going to go spend $500 million on a data center and 82 percent of the cost is wrapped up in my power bill, I want to make sure I get every dollar of my 82 percent. The concrete and land are not significant compared to the cost of power.

This means that “[eliminating] expenses associated with building additional access facilities to reach a third-party datacenter” is a non-starter compared to the power savings you get from building a large facility in the right place.  Electricity is the determining factor in placement of facility, not core routers.

(7) Vendor independence — Service providers tend to be software and hardware vendor-agnostic. The reason for this is that their broad customer bases have wide ranges of requirements and preferences, and service providers are strategically intent on reaching as wide a market as possible. Consequently, lock-in to a specific storage, server, operating system, hypervisor, middleware, database or application vendor would be self-defeating by limiting market penetration. This contrasts with some of the existing players, who mostly seem to have at least some proprietary elements to their platforms.

Hedging all bets means you are going to revert to mediocrity. Again, Cloud is software, at scale. If you are all things to all people, you aren’t going to solve anyone’s problem to their full satisfaction. If you are going to build individual blocks for each customer, you are not going to be able to deliver the application level scaling that you get from building a few core primitives and then scaling horizontally to the maximum. What you are then are in the services and solutions business, and the only leverage you are going to get is multiplexing of your SG&A across a body of customers.

(8) Global footprint — It’s not news that today’s enterprises have gone global. Whether it’s a global base of employees, customers, supply chain partners, offshore contact centers or skill base for innovation, reach and footprint are critical. Large, integrated global service providers have the capability to provide services locally and consistently virtually anywhere in the world to support today’s increasingly interactive applications with proximate infrastructures that reduce response time — and with the sales and support resources to directly engage with regional or local leadership, or corporate executives headquartered anywhere from Shanghai to Dubai, Bangalore to Brussels, or Sydney to Sao Paulo.

Proximate infrastructures are a solved problem. See Akamai. I don’t buy the sales argument. Those are replicable.

(9) Financial stability and market commitment — In today’s tumultuous economic environment, enterprises are more focused than ever on the financial stability, brand and business viability of service providers providing key parts of their infrastructures. Commitment to hosting and cloud computing as part of their provider’s core business is important, as opposed to cloud services being a potentially temporary excursion from different core businesses such as online retailing or advertising. Over the last few years, high and rising stock prices have permitted some new economy players substantial flexibility in capital investments, but recent drops of fifty or sixty percent may slow such adventurism for the foreseeable future.

Online retailing and advertising are enabled by large server farms, not the other way around. Very large compute at scale and the associated software to manage that vast infrastructure is actually the core competency of those companies. Copper plant maintenance is the telecom core competency.

(10) Technologies are easier to replicate than relationships and operations — Don’t the famously highly paid developers at the new economy companies have an edge in creating new technologies such as automated provisioning that enable cloud services to rapidly scale up and down? If they do — which is arguable — it isn’t sustainable. Such technologies have been around for years from companies as small as BladeLogic and as large as IBM (e.g., Tivoli Provisioning Manager), with variations such as VMware’s vCenter and VMotion fitting into the mix. For every highly paid developer at an online bookseller, there is a highly motivated developer at a start-up or large global software firm, developing software tools for others, like integrated service providers, to incorporate into their tooling and management platforms…. Much harder to replicate are global networks that have been built for literally hundreds of billions of dollars of investment, and the experienced skill base, long-term enterprise customer relationships, management tools, support organizations, service culture, and local access and regulatory relationships that enable services to be delivered successfully at scale.

If it was just that easy, where is the pudding?  Jonathan Heiliger from Facebook said “I am not sure whether to be embarrassed or pleased for the OEM and system vendors in the audience,” Heiliger said, “but you guys just don’t get it.” I’ve never heard anyone from a telecom come out and say something similar, and that is because they just don’t get it. One one side we have the iPhone fiasco. On the other side we have EC2 and S3 and Gmail and Bing and Facebook. Apparently those highly motivated developers at startups or large global software firms aren’t quite delivering for someone.