The Rise of Now

November 21, 2009

The ATLAS Internet Observatory report has some excellent observations, some of which are quoted below:

  • As category, CDNs represent close to 10% of Internet traffic
  • Web (and video over HTTP) largest and faster growing
  • Followed by P2P (which is also fastest shrinking)
  • P2P increasingly eclipsed by streaming

The salient points are that streaming is the fastest growing traffic by volume, CDNs  now represent about 10% of all Internet traffic, and that P2P is declining. Some of the decline of P2P can be attributed to the fact that P2P clients are now better at masquerading,  encryption and localization, but the overall trend as measured by DPI techniques is still downwards. P2P has been declining while  CDN and streaming media traffic, most of which is micropayment driven – either directly with money like Apples iTunes store, or with attention like Pandora and YouTube – has been rising. I think it makes it clear that “now” is more important than “free.”

Trying to set up P2P is a non-trivial exercise involving firewall and home gateway modifications for port forwarding, tracker location,  content hunting and so on . Centralized media delivery on the other hand is merely a click away – and if I have to pay 99 cents for a song, that doesn’t seem all that expensive.

In his thought provoking presentation at Infocom 2009, Dr. Andrew Odlyzko makes a key point: The function of data networks is to satisfy human impatience, and human impatience is infinite. Andrew also points out some classic telecom dogmas (slide 6):

  • Carriers can develop innovative new services
  • Content is king
  • Voice is passe
  • Streaming real-time multimedia traffic will dominate
  • There is an urgent need for new “killer apps”

I’ve been beating that drum for a while, especially the fact that carriers can’t develop innovative new services – they don’t have the DNA to do so (see my posts here and here) and that content is not king – connectivity is the real value. The great success story of the past few years in telecom space has been wireless and again, Andrew points out that the telecoms learned the wrong lessons. Paraphrasing Andrew: “The prevailing industry view is that profits resulted from tight control of  wireless while losses resulted from the wild and uncontrolled Internet while in reality success came from providing mobility for voice and simple text messaging.” I believe success was an direct outgrowth of convenience and “now.”  I can get much cheaper rates on a per-minute basis using a land line, but having the ability to connect to people where and when I want is the true value of mobile. I am once again, trading convenience for money. Micropayments for convenient minutes to put it another way.

Ideally, things would be available now and for free, but given the trade-off space, I am willing to trade some money for immediacy. Therein lies a lesson.


A quick observation on cloud economics

November 21, 2009

I just finished reading about a panel on cloud economics and the enterprise. One quote in particular stood out:

“I’m not sure there are any unit-cost advantages that are sustainable among large enterprises”

A few years ago some friends of mine had a startup publishing medical journals online. They started off by getting two fractional DS3 lines from MCI and Sprint to their office building. In the basement were a few racks of servers, storage arrays and it was off to the races. Today if someone came up with a plan of that nature, people would look at them funny and say “get a few racks from a colo provider.”  In another few years, I think the phrase is going to change to “get the compute and storage in the cloud.” The cost argument assumes today’s practice on tomorrows infrastructure. Next-generation business logic jobsets are going to be written for cloud frameworks, services and primitives, which should be more aligned with cost structures that make cloud computing more efficient per unit cycle of compute or unit bit of storage.

 


On Terminology And Names

November 3, 2009

“It is important to distinguish between the concepts of an object, and the name(s) of that object. This has resulted in widespread confusion between the properties of the name, and those of the object itself.” A great line by J. Noel Chiappa and one very applicable to the process of Settlement Free Interconnection (SFI) or “peering” as it is commonly known.

When people say “peering” what they most often mean is a bi-lateral settlement free interconnection. This is a business term, not a technical term, because the properties of the interconnection are orthogonal to the business relationship that causes that interconnection to exist. What are these properties?

  1. Customer and network infrastructures routes (and only those routes) are exchanged
  2. Transit (peer routes, exchange routes) are not exchanged

From #1 and #2 above, only “on-net” routes are exchanged, which means that only “on-net” traffic destined for the network’s customers and infrastructure is exchanged.

That is the technical property of a “peering” session. The flow of money is orthogonal to the mechanics of interconnection. If there is a contract or some financial relationship between the two networks, then it is termed either Settlement Based Interconnect (SBI) or “paid peering.” The properties of the interconnect remain unchanged.

So to sum up, I would like to use the following terms for interconnection universally:

  • Interconnection with “on-net” routes and no settlement: SFI
  • Interconnection with “on-net” routes and settlements: SBI

Innovation and Outsourcing

October 13, 2009

Risk:

The CEO of Air New Zealand had this to say on their supplier:

“We were left high and dry and this is simply unacceptable. My expectations of IBM were far higher than the amateur results that were delivered yesterday, and I have been left with no option but to ask the IT team to review the full range of options available to us to ensure we have an IT supplier whom we have confidence in and one who understands and is fully committed to our business and the needs of our customers.”

Reward:

Fake Steve Jobs had this to say:

See, those outsourcing deals always sounded so good: Why do you want to run a messy old data center anyway? We can do it for less than it costs you to do it yourself, and you can focus on your real core competence, which is running an airline.
Except, um, no. An airline’s core competence is running computers. I mean, think about it. Duh

Thing is, these guys did think about it. They knew the deal, but they did it anyway. You know why? Because they got to take a bunch of assets off their balance sheet and send a few hundred IT employees to IBM. It was an accounting maneuver, a way to dress up their financial reports, and it was especially appealing to weak companies. IBM takes your data center off your hands — and in some cases even pays you some money — and then sells it back to you as a service over the next decade.

If you are outsourcing, your cost advantage is lost, and not only is your cost advantage going to go away, there are some things that you are never going to be able to do. One can argue that it would make the most sense for someone like Google to focus on their core competency, not waste time building servers.  But not only are they building servers, the fact that they viewed it as a core competency allowed them to make things better by optimizing the system, including on-board batteries which enabled datacenters without centralized UPS’s.

People define core competencies far too narrowly. It is not simply that someone chose to view building servers as a core competency, it is that they saw the massive advantage to all their efforts of controlling their infrastructure destiny as an enabler and thus took it as a core competency.

Those leaps of innovation are just not going to happen if you are focusing on your “core competencies” while letting others build your infrastructure. It can be argued that at Google’s scale, servers are a core competency – for example no one is going to argue that if you need a 1000 servers, you are better off  using a reverse auction, but if you are a global service provider, you are not building 1000 servers, you are in fact, working on your core competency, a point which does not seem as clear as it perhaps may appear.  How are you going to avoid being a dumb pipe if you can’t even control your own infrastructure at scale?

Edit: Benjamin Black added clarification


Danger Data Loss

October 12, 2009

I have been following the news of the Microsoft/T-Mobile danger user data loss and how it puts cloud computing in a bad light. First, I’d like to echo John Bradford: “There but for the grace of God go I.”  As an operations guy first and foremost, my thoughts are with the people on the ground working this problem. I’ve slumped heartbroken in my chair more than once over backup tapes that tested fine but won’t restore. Operations are hard, systemic failure is harder and is very difficult to test for. However, there are some basic points I’d like to bring up:

  • Cloud service providers are not all the same. Like car manufacturers,  various cloud providers will optimize for different things.
  • Safeguarding of the user data is a process and mentality that needs to be deeply ingrained. Keeping users data safe should be paramount.
  • If the failure of your cloud provider causes your business to suffer, this is your problem, not your cloud providers. You are responsible for your uptime and you have to engineer for it.
  • Cloud storage is a convenience/risk trade-off and almost everyone will pick convenience. Humans are not very good at proper risk assessment.

If everyone in your operations team quits one day, are there procedures and processes in place that allow someone else to step in and operate the service, including the backups? Do you run tests every so often that simulate failure, including restoring data and verification that everything works?  Did you test restoring the system under peak, not steady load?

Backing up the bits is just the start of the entire operations procedure, not the end as most people have it.


Streaming Video

September 27, 2009

After seeing several red envelopes at a friends house the other day I started to wonder what Netflix’s cost structure would look like if streaming video replaced sending DVDs by mail.   Mindful of Andrew Tanenbaum’s adage about never underestimating the bandwidth of a station wagon full of tapes hurtling down the highway, I thought it might make sense to do a cost model and see if streaming DVDs would be as cost effective as shipping them. This is a very simple model that does not take into account several crucial factors such as the First Sale doctrine, licensing for streaming,  partnering with studios instead of sourcing DVDs etc.  Leaving those aside and focusing on the technological aspect of the cost modeling is still quite illuminating.

The model is based on the NETFLIX INC (NFLX) 10-Q filed 7/31/2009. The results were quite surprising.

The article was improved thanks to contributions by Ben Black, Randy Epstein and Alex Pilosov


Peering Policy Analysis

September 8, 2009

Peering or Settlement-free Interconnect (SFI), is a contentious subject as can be seen here and here. Having been involved in a few SFI negotiations and and disputes myself, I thought I’d write my analysis using an existing SFI policy as a vehicle for the analysis.

First, what is SFI? Simply put it is the bilateral exchange of  two service provider (SP)’s customer routes without payment by either side (settlement-free). A more detailed explanation can be found here.

The technical details and various modes of the peering definition could go on for quite some time, but the question at the heart of the matter is: “will provider X interconnect with me on a settlement-free basis?” Network Service Providers want to connect to other networks on a settlement-free basis because it allows them to exchange traffic for free with them, without having to pay an upstream to carry their traffic. The upstream providers do not want to interconnect on a settlement-free basis because they lose revenue.

Geoff Huston has a very good statement of what settlement-free interconnection really means.

The bottom line is that a true peer relationship is based on the supposition that either party can terminate the interconnection relationship and that the other party does not consider such an action a competitively hostile act. If one party has a high reliance on the interconnection arrangement and the other does not, then the most stable business outcome is that this reliance is expressed in terms of a service contract with the other party, and a provider/client relationship is established.

Like taking margin in the retail industry, SFI will only be granted if the benefits of interconnection outweigh the cost. It really is that simple.

With that in mind, let us take a current SFI Policy and analyze the technical aspects. To ground the discussion in reality, I will use the Comcast SFI Policy as of September 2009. It is a good example of a well-written, modern SFI policy. Comcast policy text is in blue.

Applicant must operate a US-wide IP backbone whose links are primarily 10 Gbps or greater.

This is to ensure that the applicant’s network is similar to Comcast in size and has a similar cost basis. Traffic engineering and management are simplified due to similar bandwidth on the interconnecting backbones as traffic flows tend to be of similar size. There have been people who have interconnected at 10G with the backhaul restricted to STM-1/OC-3 links, causing saturation and a poor user experience.

Applicant must meet Comcast at a minimum of four mutually agreeable geographically diverse points in the US. Interconnection points must include at least one city on the US east coast, one in the central region, and one on the US west coast, and must currently be chosen from Comcast peering points in the following list of metropolitan areas: New York City/Newark NJ, Ashburn, Atlanta, Miami, Chicago, Denver, Dallas, Los Angeles, Palo Alto/San Jose, and Seattle.

This clause ensures that the applicants network is similar to Comcast in scope (and has a similar cost basis) and has the same redundancy, size, and diversity of connection that allows Comcast to easily integrate the interconnection and session management into their traffic engineering and operational procedures.

Applicant’s traffic to/from the Comcast network must be on-net only and must amount to at least 7 Gbps peak in the dominant direction. Interconnection bandwidth must be at least 10 Gbps at each interconnection point.

This requirement ensures that the network is at par with other SFI networks, making traffic engineering and operational management easier.  It should be subject to change regularly based on network evolution. The only thing I would change in the requirement is to substitute average for peak. With peak and 95th percentile a small number of samples dominate the calculation.  With average, that is not the case. Peak and 95th percentile are relatively easy to game, not so with average. Any metric that allows dominance of the outcome by a small set of samples is contraindicated in peering calculations, whereas in customer/provider relationships they are preferred by providers. The former situation is optimized for volume and the latter is optimized for rate.

A network (ASN) that is a customer of a Comcast network for any dedicated IP services may not simultaneously be a settlement-free network peer.

This requirement has caused more confusion than any other clause to my knowledge. Most people interpret this to mean “once a customer, always a customer, with no possibility of getting SFI in the future.”  This is quite incorrect. What it actually means is that if you are a customer, you cannot simultaneously interconnect for free for on-net routes. This comes up when customers want only to pay for “off-net” traffic and is implemented by the provider by setting up multiple interconnections.  Announce customer routes (the on-net traffic) on some interconnections and only announce  peer (or off-net) routes on others. If the provider offers this option there are many ways to game it. This requirement is self-defense and eliminates operational complexity.

Applicant must have a professionally managed 24×7 NOC and agree to repair or otherwise remedy any problems within a reasonable timeframe. Applicant must also agree to actively cooperate to resolve security incidents, denial of service attacks, and other operational problems.

Applicant must maintain responsive abuse contacts for reporting and dealing with UCE (Unsolicited Commercial Email), technical contact information for capacity planning and provisioning and administrative contacts for all legal notices.

This requirement ensures that there is a good point of contact that is reachable at any time, considerably simplifying technical and policy coordination between networks.

Applicant must agree to participate in joint capacity reviews at pre-set intervals and work towards timely augments as identified.

Traffic forecasting and pre-planning for capital expenditutures, metro and PoP upgrades is essential as they take time to get deployed in the field.

Applicant must maintain a traffic scale between its network and Comcast that enables a general balance of inbound versus outbound traffic. The network cost burden for carrying traffic between networks shall be similar to justify SFI.

This  is another very controversial requirement – the so-called ‘Ratio clause.’  The best way to look at it is via the Geoff Huston definition above, any other way of looking at this is doomed to failure. This requirement serves as another way to ensure that the interconnection applicant has a similar scale and scope network as Comcast, with a similar cost basis as measured by the cost of carriage of a bit/mile.

Applicant must abide by the following routing policy:
Applicant must use the same peering AS at each US interconnection point and must announce a consistent set of routes at each point, unless otherwise mutually agreed.

Consistent route announcements are useful to prevent gaming (see ratio requirement mentioned earlier), help in troubleshooting and traffic engineering.

No transit or third party routes are to be announced; all routes exchanged must be Applicant’s and Applicant’s customers’ routes.

If a network starts announcing transit or third party routes, those prefixes will interfere with normal routing and traffic engineering, potentially severely disrupting Internet connectivity for customers. Sending a large amount of transit routes can also potentially double or triple the number of paths in the routers, causing them to run out of resources and crash.

Applicant must filter route announcements from their customers by prefix.

Customer routes are preferred in most networks, and are announced to other SFI networks as the best path to reach that customer. If the customer makes an error such as leaking another providers upstream routes, it can cause significant disruption. For example, by making the customer look like it has the the best route to that upstream provider. The wrong information may be propagated to Comcast and their SFI networks, causing traffic to to be incorrectly routed.

Neither party shall abuse the SFI network peering relationship by engaging in activities such as, but not limited to: pointing a default route at the other or otherwise forwarding traffic for destinations not explicitly advertised, resetting next-hop, selling or giving next-hop to others.
Applicant should be willing to enter into an NDA before formal discussions begin.


The abuse requirement simply says do not try to steal service by pointing a default, or faking next-hops.  The NDA requirement is quite standard when entering into negotiations for something as sensitive as SFI.

Applicant should be advised that the SFI processes will start with a 90 day trial.  On successful completion of that trial, a formal interconnect agreement will be processed.  This agreement will renew annually, subject to the then current SFI Policy.  During the year if there is a violation of the policy, the agreement and interconnections may be terminated upon written notice to the contacts specified in the agreement.
A 90 Day trial to verify that the traffic, ratio and other technical conditions are satisfied is reasonable. It allows for sufficient time to verify the claims for volume and ratio, but is not so long that it starts looking like  a revenue generation mechanism.

Applicant shall not be permitted to offer or sell any IP transit services providing only AS7922.

This particular requirement prevents networks that meet the SFI requirements from selling cheap, direct access to the Comcast network to networks who otherwise do not meet Comcast SFI requirements.  This violates the equivalent cost basis argument for SFI.

Applicant must be financially stable.
Comcast requires that Applicants seeking SFI in the United States agree to provide reciprocal SFI arrangement with Comcast in the Applicant’s home market.

Excellent clauses. Comcast is US centric (for now). If they ever expand out to different geographies, there is a ready-made interconnection system in place.

This is a good, rigorous policy that sets out a fair, even-handed system of evaluation for SFI with Comcast. The requirements are clear, well articulated and make technical sense and that makes a sensible trade-off between of cost of interconnection and the value to the Comcast customer base.

Article was vastly improved thanks to editing and wordsmithing help from Ben Black.


Femtocells

August 26, 2009

Om Malik wrote an interesting piece on Femtocells and the failures in Fixed Mobile Convergence (FMC).  Quoting from the article:

According to The Wall Street Journal, femtocells aren’t doing terribly well — sales are slow and demand is weak. It’s a classic chicken-and-egg situation. Carriers are waiting for demand to go up, while folks (like me) are waiting for prices — which currently range from $100 to $250 for the device alone, plus a monthly service fee — to come down.

The rest of the article goes into some details as to what the issues are but what jumps out is the phrase “plus a monthly service fee.”  This encapsulates precisely what I believe is wrong in the telecom world -more focus on small incremental revenues instead of looking at what service and value can be provided to make the customers happy.  The mobile industry is one of the industries where 15%-25% of their entire customer base churns out every year. What would it look like if the churn was an order of magnitude less?  Let’s see what the benefits of a femtocell are:

  • Remove load from the spectrum allocation and tower backhaul (scarce resources)
  • Improve the customer experience
  • Possibly reduce tower density (and associated cost with rental, power, backhaul)

For all this, you expect the customer to pay you to put a femtocell in their house? How about offering customers a discount for calls made via femtocell?

Now comes the delicate balancing act of figuring out who pays for the femtocell?  One option is to have customers buy them outright. Another one is to sell a discounted version, but extend the contract.  Asking for a monthly payment when the customer who is buying the device is unhappy with the coverage is just adding insult to injury.


Cloud Part 2

August 25, 2009

Joe Weinman wrote an article on cloud computing titled 10 Reasons Why Telcos Will Dominate Enterprise Cloud Computing. My response to that article is here. Today, I was tracerouting to  ATT.com. The results were surprising, so I did some more digging around.

vgill$ dig www.att.com

; <<>> DiG 9.4.3-P1 <<>> www.att.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4018
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.att.com.                   IN      A

;; ANSWER SECTION:
www.att.com.            254     IN      CNAME   www.att.com.edgekey.net.
www.att.com.edgekey.net. 19856  IN      CNAME   e2318.c.akamaiedge.net.
e2318.c.akamaiedge.net. 11      IN      A       96.6.249.145

;; Query time: 18 msec
;; SERVER: 68.87.76.178#53(68.87.76.178)
;; WHEN: Mon Aug 24 22:36:30 2009
;; MSG SIZE  rcvd: 115

ATT.com points to a CDN service provided by Akamai!


Good vs Great: Hiring and Maintaining Engineering Organizations

August 16, 2009

Richard Hamming gave an interesting talk at a seminar in March 1986 called “You and Your Research.” The entire talk is worth reading and rereading, but one particular section is very relevant to my interest as a hiring manager whose goal has been to build the best possible organization.  In it, Dr. Hamming poses and answers a question: If it is so easy, so why do so many people, with all their talents,  fail?

Well, one of the reasons is drive and commitment. The people who do great work with less ability but who are committed to it, get more done that those who have great skill and dabble in it, who work during the day and go home and do other things and comeback and work the next day. They don’t have the deep commitment that is apparently necessary for really first-class work. They turn out lots of good work, but we were talking, remember, about first-class work. There is a difference. Good people, very talented people, almost always turn out good work. We’re talking about the outstanding work, the type of work that gets the Nobel Prize and gets recognition.

This ties back to a fundamental thesis that it is necessary but not sufficient to be smart. Getting things done is equally important.  When hiring and building a team, you have to hire people who are sharp and and can get down to the block and tackle of execution – people who are committed to seeing their work come to fruition.  How do you find and hire these people? Ideally, first hand observation of their work – folks you have worked directly with. There is simply no interview cycle that can substitute for first hand experience.  Go over a list of people you know, pull out the contact information, call the top performers. Talk to the team, ask them to pull out some names of people they have worked with. Follow up with them. As a hiring manager, recruiting a top notch team is top of the list. Even if you have no openings right now, keep current with what insanely great people are doing. An unexpected need  may come up at any time. This leads to my second point – for top level design and engineering work, it is sometimes better to leave the position open than settle and hire out of desperation for any reason. “My headcount will get taken away” or “Someone is better than no one.” The penalties for bad hires will be worse than the work you have to drop because you did not have a candidate in place.  Stack ranking existing work will often prove to be quite illuminating – some of the must-do projects turn out to not so when examined with the critical eye that dispassionately judges projects on merit, stripped of egos tied into work. Bad hires will not only sap morale from the group, lead to people covering for the work not being executed to standards expected and often frustrate your top performers. They will also sap valuable time and energy from you that could be spent more productively  on the top performers.

This may be controversial, but I have always felt that the top performers should get the bulk of your attention, because they will flourish and produce disproportionately more. A simple exercise may help clarify the point:

Assume you have 1 unit of attention to spend and you will gain a result that is proportional to the ability of the person you spend it on.  Now, you can choose to spend it on someone who is 5 times better than the average person, or someone who is average. If you spend that attention on the top performer, you will gain 5 times the result you would if you spent it on the lower performer. As Joel has pointed out here, the top performers are sometimes 10x better than the average.  They should get the bulk of your time and attention. Spend the time and treasure to make your best people insanely great.

Spend time working with the lower performing people (in their current positions) to find out what roles would suit them better and then get them into those jobs where they have a better chance of flourishing. Not everyone is going to be successful in all roles, play to their strengths rather than trying to shore up their major weaknesses. This does not mean do not round them out, but do not expect someone weak in a particular area to thrive in a role that has a great deal of emphasis on abilities that are not their core strengths. The best thing for everyone involved is align strengths with roles.

Assuming you have done the above – hired great people who can get things done, and made sure the organization is firing on all cylinders – now you need to get out of the way. Do not specify methodologies, only directions. Make sure results are measured and known. You’ve hired insanely great people, trust them to do the job. Your job at this point is to make sure they are not distracted, remove obstacles, listen to their griping, and ease the way so they can get stuff done. Oversee, make sure the big picture is communicated consistently, trust and keep your team in the loop and occasionally nudge them back on the rails and in cases of logjam – arbitrate fairly. One great technique I’ve picked up here is that if the team cannot decide between different approaches, a decision will be made in a week: I will pick one of the alternatives at random. This has done wonders for the team reaching a consensus.

Note that consensus doesn’t mean unanimity – waiting for unanimity is a good way to make sure that productivity goes to zero. Consensus means that once there is a decision, the entire team rallies around the decision and then focuses on getting things done. Which is what they are good at, so you’re in the clear there.

Make no mistake – this is all very hard work. Hiring, keeping your contacts in place, working with the top performers, arbitrating and keeping the way clear to allow your team to execute will not be easy. However the reward of watching  your team go above and beyond what they thought, grow and stretch and work miracles will be worth every review, every meeting, every phone call at dinner time to a potential candidate in a different time zone.