Yesterday’s wide-scale internet outage was triggered when a single Fastly customer changed their settings, it has emerged.
The problem took place on Tuesday June 8, when Fastly, a cloud computing services company, experienced a bug on its content delivery network (CDN). This led to several major websites, including Amazon, Reddit, The Guardian and New York Times being forced offline for 30-40 minutes from around 11am. Additionally, specific sections of other services were affected by the failure.
The problem was resolved relatively quickly, with Fastly revealing in a tweet that it had disabled a “service configuration that triggered disruptions across our POPs globally.”
In a post on its website earlier today, Nick Rockwell, senior vice president of engineering and infrastructure at Fastly, revealed that the problem occurred when one of its customers changed their settings. This exposed a bug in a software update that was issued by the company on May 12 “that could be triggered by a specific customer configuration under specific circumstances.”
It has since created a permanent fix for the bug, which was deployed at 17.25 UTC on June 8.
Rodwell acknowledged that Fastly should have anticipated the outage and said the company is currently “conducting a complete post mortem of the processes and practices we followed during this incident.”
Apologizing for the impact caused, he added: “This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.”
The update has raised concerns about the resilience of the internet and in particular, the reliance on a handful of companies to run its vast infrastructure. Tim Mackey, principal security strategist at the Synopsys CyRC, commented: “All software has bugs, and it’s not always realistic to test all deployment configurations prior to deploying a new software version. Due to the scalability present in most cloud solutions, businesses have grown accustomed to the resiliency of cloud platforms. So when a bug meets up with an untested deployment configuration in a cloud solution, you can end up with precisely the scenario that Fastly customers found themselves with – a major outage.”
However, Mackey did praise the cloud service provider’s response to the incident so far. “To their credit, the Fastly team quickly identified the issue and created a patch, but not before a number of high-profile web properties were impacted,” he outlined. “The Fastly team indicate that they will be performing a review of their release practices to determine how the bug was able to escape remediation prior to the outage. Such reviews are common within teams following the blameless review cyber-incident process used by DevOps teams. Should that review identify a weakness in development practices commonly found within DevOps teams, I would hope the Fastly team take this opportunity to highlight how other large scale organizations might improve their operations by learning from the Fastly experience.”