Danger Data Loss

I have been following the news of the Microsoft/T-Mobile danger user data loss and how it puts cloud computing in a bad light. First, I’d like to echo John Bradford: “There but for the grace of God go I.”  As an operations guy first and foremost, my thoughts are with the people on the ground working this problem. I’ve slumped heartbroken in my chair more than once over backup tapes that tested fine but won’t restore. Operations are hard, systemic failure is harder and is very difficult to test for. However, there are some basic points I’d like to bring up:

  • Cloud service providers are not all the same. Like car manufacturers,  various cloud providers will optimize for different things.
  • Safeguarding of the user data is a process and mentality that needs to be deeply ingrained. Keeping users data safe should be paramount.
  • If the failure of your cloud provider causes your business to suffer, this is your problem, not your cloud providers. You are responsible for your uptime and you have to engineer for it.
  • Cloud storage is a convenience/risk trade-off and almost everyone will pick convenience. Humans are not very good at proper risk assessment.

If everyone in your operations team quits one day, are there procedures and processes in place that allow someone else to step in and operate the service, including the backups? Do you run tests every so often that simulate failure, including restoring data and verification that everything works?  Did you test restoring the system under peak, not steady load?

Backing up the bits is just the start of the entire operations procedure, not the end as most people have it.

Advertisements

7 Responses to Danger Data Loss

  1. Director of Front says:

    What does the Danger outage have to do with the cloud? Were they purchasing any “cloud” services from any vendors, save in the loosest sense of the term, some network-attached storage? This seems like a weak link.

    It’s easy to kick them while they’re down, and backups are always important, but I’d suggest we wait to hear from the Microsoft folk who actually have firsthand knowledge of the situation before we further speculate.

  2. tsoul says:

    the other thing is testing relentlessly. a lot of people think hard about their procedures and take the necessary precautions, but completely fail at the act of creating valid tests to provide assurance to the procedures you develop. the lack of effective “fire drills” in operations is a huge vulnerability on the face of the community.

  3. But are we at a new and major step-function in data volume that we don’t quite know how to deal with? If we are storing 100’s of Gigabytes and even Petabytes in distributed file systems (GFS, DFS) and in BigTable / HBase like environments, how do we do backup? Maybe Google knows how or they just have enough machines to throw at it.

    But as we know from RAID redundancy isn’t backup. If you let loose a program that accidentally corrupts or erases the data, its going to do it across all that redundancy.

    I’m wondering if there is a best practice for this yet?

  4. Director of Plump says:

    I agree with the statement that this is not a cloud issue.

    One must wonder what was going on in the minds of the operations folks involved during this and in any risk assessment meetings prior to the execution of whatever work was underway. Were there steps considered excessive that were ignored? Frequently in any operations environment, there is always a situation where people shrug and say “nah, that shouldn’t REALLY happen”. The problem is that the unexpected does occasionally happen.

    For example, a failed power supply replacement could result in someone yanking out the valid working one. However, some would categorize the replacement of a failed power supply a low risk operation.

    We don’t know all the facts but this should serve as some reminder that you need to have all your bases covered and your operations staff vigilant. Never rely upon any single person, never be left in a situation where you cannot easily back out.

  5. MMC says:

    I think Vijay’s point is that Danger is being used AS a cloud by crack-berry owners and that they should be backing up their data themselves.

  6. rob rodgers says:

    I would think that a distributed, replicating copy-on-write filesystem deals well enough with the rogue program issue; it does not help with updates to the fs code itself, though.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: