Multi-tenant cloud providers might promise greater resiliency, ‘five nines’ uptime and better security than some in-house managed infrastructure, but organisations would be wise not to assume the provider has covered all bases.
US movie streaming service Netflix, which began migrating its data centre to Amazon’s EC2 cloud in 2009, has gone well beyond Amazon’s dashboard to better understand the risks it faces.
Wanting to discover what would happen in the event of various disasters, the company has created a dozen automation tools it calls Monkeys to simulate chaos in the cloud and show what would happen to variously dependent systems in the event of “once in a blue moon” failures.
Latency Monkey, for example, simulates service degradation, Conformity Monkey finds and ousts sub-optimal instances, and Janitor Monkey hunts for wasted resources, while Security Monkey checks SSL and DRM certificates are valid and whether security violations or vulnerabilities exist.
The biggest 'monkey' is of course Chaos Gorilla, a rendition of its predecessor, Chaos Monkey. Like the gorilla name suggests, it simulates an outage of an entire Amazon availability zone to test whether Netflix can shift resources to another functioning zone without disrupting services.
The company claims that the Monkeys gave it an “almost free” set of tools to automate resilience and security testing, but its efforts highlight some of the additional investments that could be required by moving infrastructure to the cloud.
And its efforts still could not prevent a two hour disruption of services this week. Netflix advised customers between August 9 and 10 that it was experiencing problems with its streaming service, which came a day after an Amazon EC2 zone suffered “connectivity issues” North America.
Carlo Minassian, chief executive officer of Australian network security specialist Earthwave was impressed with Netflix’s automation tools since it allowed the company to take AWS cloud performance measurements in its own hands and challenge assumptions about cloud provider reliability.
“Most organisations will assume their cloud provider has security covered,” he told CSO.com.au.
“After all, doesn’t the five 9’s mean close to no downtime at all? Doesn’t that mean next to no hardware problems and no security breaches? Does your cloud provider define how they measure uptime or availability?
Although the two mean separate things for the customer, vendors often "carelessly" interchange them.
"Uptime is a measure of whether the service is actually running; availability is a measure of whether the service is running and accessible," explained Minassian.
“There are a few among us who may have suffered an outage or two on the services offered by their cloud providers.”