I have been following the news of the Microsoft/T-Mobile danger user data loss and how it puts cloud computing in a bad light. First, I’d like to echo John Bradford: “There but for the grace of God go I.” As an operations guy first and foremost, my thoughts are with the people on the ground working this problem. I’ve slumped heartbroken in my chair more than once over backup tapes that tested fine but won’t restore. Operations are hard, systemic failure is harder and is very difficult to test for. However, there are some basic points I’d like to bring up:
- Cloud service providers are not all the same. Like car manufacturers, various cloud providers will optimize for different things.
- Safeguarding of the user data is a process and mentality that needs to be deeply ingrained. Keeping users data safe should be paramount.
- If the failure of your cloud provider causes your business to suffer, this is your problem, not your cloud providers. You are responsible for your uptime and you have to engineer for it.
- Cloud storage is a convenience/risk trade-off and almost everyone will pick convenience. Humans are not very good at proper risk assessment.
If everyone in your operations team quits one day, are there procedures and processes in place that allow someone else to step in and operate the service, including the backups? Do you run tests every so often that simulate failure, including restoring data and verification that everything works? Did you test restoring the system under peak, not steady load?
Backing up the bits is just the start of the entire operations procedure, not the end as most people have it.