Web accelerator and content distribution network (CDN) provider Cloudflare has become such a critical component of the web that when it goes down, so do many others. At least those that don’t have a backup to services provided by Cloudflare.
It’s the second major outage it’s experienced in the past week that has affected major websites and millions of users across the US and Europe. And while it only lasted 30 minutes, the incident is notable because of how many sites were affected.
The first multi-hour outage was caused by Verizon and one of its customers due an avoidable internet-routing blunder. Network monitoring firm, ThousandEyes, has good write-up on what happened then.
This one on Tuesday - which lasted a long half an hour for some websites - was, to an extent, self-inflicted but nonetheless caused multiple websites to display “502 error” messages when users visited them.
Outage detection service DownloadDetector saw reports from users about Cloudflare spike from none to several thousand within an hour today.
Cloudflare CTO John Graham-Cumming raced out a short blog to superficially explain the guts of the problem until it has all the answers.
According to him, a “bad software deploy” caused a “massive spike in CPU utilisation” that briefly knocked out some of its data centres, bringing down Cloudflare-dependent websites with it.
The global outage occurred 14:57 UTC so most Australians would have not noticed any issues, but for European and US users the issues would be more apparent.
The company has rolled out a temporary fix and says on its status page that a “major outage impacted all Cloudflare services globally”, listing basically every country in the world it has CDN infrastructure. The issue affected both primary and backup systems.
“This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels,” explained Graham-Cumming.
The CTO also dismissed rumors that Cloudflare was under a possible distributed denial of service (DDoS) attack. The company provides DDoS attack protection to websites around the world but has also been hit with massive 400Gbps DDoS attacks in the past.
While some apparent Cloudflare users, likely website operators, questioned the firm’s software update testing procedures, others pointed out that no business should be entirely dependent on a single provider.
“[Cloudflare is] a service provider, just like anyone else - You shouldn't have your business rely upon one single provider,” wrote Matthew Scully in response to Graham-Cumming’s blog.
Scully suggested complaining customers should not “put all your eggs one basket” and use a competing CDN service too.
Graham-Cumming plans to publish a full post-mortem of what went wrong today and why a single bad software update could impact its data centres in the whole Asia Pacific, all of Europe, South America, and North America.