Dragons in the Cloud: What Can We Learn from Amazon S3?

Table Of Contents show

On February 28, 2017, Amazon Web Services’s servers decided to celebrate Mardi Gras in an unusual way. A problem in one of AWS’s massive data centers in northern Virginia led to widespread outages, with far-reaching consequences. Amazon’s web hosting services are among the most widely-used; when those servers go down, a lot goes with them. When AWS’s Amazon S3 cloud storage suffered its outage, many websites felt the earth tremble.

As a result of Amazon S3’s troubles, some sites such as Trello and Quora went offline entirely, as well as websites created through Wix; other websites like Hubspot and GroupMe experienced performance issues. Ironically, Isitdownrightnow.com, a website that checks if other websites are down, was also briefly shut down due to the outage.

The Internet of Things also felt the sting of the Amazon S3 outage. Users of Nest and IoT-enabled devices, including everything from smart lightbulbs and thermostats to computer mice, ran into performance issues and took to Twitter to document their surreal experiences.

The Amazon S3 outage was so bad that AWS couldn’t even update its own dashboard for its end users. Because its red warning icons were stored on the affected part of the cloud, Amazon couldn’t display the proper status icons for AWS users and defaulted to the “all clear” green status icons. After five bizarre and panic-filled hours, though, Amazon was able to reset the downed servers and Amazon S3 had fully recovered. The web was back in business.

Part of the reason the outage was so shocking (other than the fact that such a large portion of the web was affected) was that Amazon Web Services has had a great track record in the past of these things not happening. The last time an outage anywhere near this magnitude happened was August 2015. Events like these are rare, and outside of them, AWS has had nearly 100% annual availability in most cases—which just made February 28’s Amazon S3 outage all the more of a surprise to the thousands of sites relying on it.

No matter how reliable a service is, accidents happen. Accidents happen, even if you’re Amazon. Unfortunately, while a sub-par cloud service provider might make you accustomed enough to outages to make redundancy plans, a service that works perfectly 99.999% of the time can lull you into a false sense of security. And so when the stuff finally does hit the fan, it catches you with your pants down. The Amazon S3 outage provides MSPs and IT consultants with a timely opportunity to talk to your clients about redundancy, disaster preparedness, and the dangers of “leaving it all to the Cloud”.

What Exactly Happened to Amazon S3, Anyway?

This particular disaster was caused (like so many in cyberspace) by—of all things—a simple typo.

That morning, Amazon S3’s debug team needed to take a few servers offline to run bug checks. They mistyped the command, though, and ended up taking down more than just a few servers. In fact, they accidentally took down an entire subsystem. This subsystem happened to be the one that allowed the whole S3 cloud to perform basic data storage and retrieval tasks. It took five hours to bring all of the affected servers back online.

Whoops.

What Users Should Learn from Amazon S3

Everything we consider “the Cloud” lives somewhere on someone else’s computer. It’s not some nebulous creation of cyberspace, despite what its name may imply. But your client might not think so from the way cloud service providers talk about what they offer sometimes. There’s hardware behind it—lots of it. And you can take it from us here at Gillware—all hardware fails eventually.

The Cloud Is Not “Someone Else’s Problem”

In Douglas Adams’ satirical science-fiction novel Life, The Universe, and Everything, the character Slartibartfast hides his starship from prying eyes using an “SEP field generator”. In the book, rather than standing for some scientifically-dubious technobabble, “SEP” stands for “Someone Else’s Problem”. Unlike a Romulan cloaking device or Harry Potter’s invisibility cloak, the SEP field makes anybody who sees the ship simply dismiss it as “someone else’s problem” and go about their business.

The danger of the Cloud is that often it seems to be protected by an SEP field of its own. People throw their files onto Cloud-based sharing and backup services, host their websites on the Cloud, entrust their businesses to the Cloud… and it becomes “someone else’s problem”.

But the Cloud isn’t someone else’s problem. Rather, it’s someone else’s data center (not quite someone else’s computer, despite what the memes may claim).

Randall Munroe's XKCD provides a humorous take on "the Cloud" — Randall Munroe’s XKCD provides a humorous take on “the Cloud”

Of course, people expect whoever’s running that data center to take care of whatever comes up. But accidents happen. Murphy’s Law is always in effect. Whether something will go wrong is a matter of when, not if.

They’re not just putting all their eggs in one basket. They’re putting all their eggs in someone else’s basket and trusting them not to drop it.

Out of Sight, Out of Mind, Out of Luck

Unfortunately, the expectation that “someone else will take care of it” is not unique to cloud-based services. “Out of sight, out of mind” attitudes are sadly all too common. We here at Gillware see the same story all the time with small business owners who entrust their data to consumer-grade NAS devices, for example. They buy a five-bay RAID-5 NAS device, shove it in a closet, and power it on. Their brand new Z: drive pops up, they forget about the device, and eventually, it breaks down without warning—because the owner thought it would work just fine “out of the box” and never thought to check on the device or configure it to alert them when it started to fail. The next thing they knew, their data’s safety rested in the hands of our NAS server data recovery specialists (thankfully, our engineers are very good at their jobs).

What MSPs Should Take Away From the Amazon S3 Outage

“Our weaknesses are always evident, both to ourselves and others. But our strengths are hidden until we choose to reveal them–and that is when we are truly tested. When all that we have within is exposed, and we may no longer blame our inadequacies for our failure, but must instead depend upon our strengths to succeed … that is when the measure of a man is taken, my boy.” – James A. Owen, Here, There Be Dragons

Of course, as an MSP, you can’t do everything. You’re a consultant, after all, not a panacea offering every service your client needs like some IT version of the multi-armed Hindu god Vishnu (despite what your clients may seem to expect sometimes). You have to delegate sometimes and point your clients toward certain services based on your professional experience, whether that means pointing them toward a backup service or a good VPN, or making recommendations on a choice of web hosts.

Your client won’t be relying on you for everything. But you’re still the one consulting your client. And when things go wrong, your client will come to you and expect you to help mitigate the damage. What they care about is what you can do as a managed service provider and consultant to help them.

The Amazon S3 outage, much like the Cloudbleed scare from earlier in February, serves as another reminder to MSPs that no matter who’s at fault, helping your client get back on their feet and helping with harm reduction is part of your responsibilities as a managed service provider.

Remind Your Clients: Redundancy is Important

AWS has, all things considered, provided an incredibly reliable service outside of a handful of mistakes like this. To boot, the February 28 Amazon S3 outage has convinced them to put in more redundant systems to prevent, or at the very least mitigate, downtime (and potential loss of revenue) for their legion of clients and protect them from the effects of data center errors.

But in spite of Amazon Web Services’s best efforts in the future to take more redundancy measures, you can count on Murphy’s Law to rear its ugly head sooner or later.

Any other cloud-based service carries the same risks (with some considerably less robust than Amazon). Make sure your clients are aware of the risks of leaving everything up to the Cloud and making their data “Someone Else’s Problem”. Talk to them about building more redundancy into their organization. This could mean using more than one cloud storage provider, or combining in-house data backup with a cloud backup system, for example.

With proper redundancy in place, the next time a cloud-based service provider has some trouble, your client might not need to come running to you for help.

Of course, proper redundancy can be expensive. Perfect redundancy certainly is. Often, the ideal redundancy measures just aren’t in your client’s budget. Often, as the saying goes, “perfect” is the enemy of “good”. At the very least, you can help your client have adequate protection against service outages.

When these things happen, you can help your client get back on their feet. In the aftermath, you can help prepare your client for the next time something like February’s Amazon S3 outage happens to hopefully mitigate the damage and downtime. Talk to your client and help them ensure that they aren’t keeping too many eggs in too few baskets.