The Day Facebook Stood Still

Martin Bosshardt
Posted by Martin Bosshardt on 05 October, 2021
anapaya-the-day-facebook-stood-still-image

How SCION could have averted the 2021 Facebook, Instagram, and WhatsApp outage

On Monday, October 4th, 2021 major social media services including Facebook, Instagram, and WhatsApp were hit by a massive outage, impacting potentially tens of millions of users and businesses. The issue prevented users from accessing the services at all for approximately six hours, causing disarray for consumers, businesses, and regular people who simply want to communicate with friends and family.

How Facebook disappeared from the Internet

The massive outage affected millions of people around the world. Outage tracking website Downdetector.com had received 10.6 million reports of problems ranging from the United States and Europe to Colombia and Singapore, with trouble first appearing at about 15:45 GMT. Service was not restored until several hours later in what Downdetector described as “the largest outage we’ve ever seen”.

The outage comes a day after whistleblower Frances Haugen, a former Facebook product manager, exposed thousands of internal documents she said showed the social media giant failed to protect users, leading many to believe the outage was caused by a malicious attack. However, Facebook engineers have confirmed that the outage was caused by configuration changes on the backbone routers that coordinate network traffic between their data centers. Either way, the disruption caused a cascading failure that prevented both digital and physical fail-safe measures to come into play. For instance, people couldn't get back into Facebook’s building, as the door locks needed the Internet to verify their credentials. In addition, administrators could not log into the data center anymore, as the data centers were unreachable causing both physical and online access routes to fail.

The effects of the outage

The catastrophic outage resulted in Facebook founder Mark Zuckerberg's personal wealth falling by nearly $7 billion in a few hours - but the effects for critical infrastructure and small businesses are still being recorded. According to American news station CBS, the outage had serious consequences for small businesses, resulting in critical financial losses for countless businesses that rely upon Facebook and Instagram.

The outage also had a knock-on effect for larger companies. According to The Verge, Twitter, Is It Down Right Now and cellular providers like T-Mobile and AT&T were affected, with either being overloaded from traffic or being falsely reported as malfunctioning. The situation also led Cloudflare, to mobilize extra resources to keep up with the traffic of people trying to load Facebook (or Instagram or WhatsApp) over and over.

In total, over 3 billion Facebook users, 2 billion WhatsApp users, and 1 billion Instagram users were unable to access the services, leading many to question how reliant and vulnerable we are as users and businesses to digital disruptions - yet there may have been a way to avoid this situation completely.

SCION - a better approach to connectivity

According to Facebook, the key flaw that caused the outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

Adrian Perrig, SCION co-founder and a professor from ETH Zurich who leads the Network Security Group, explained the scenario in more detail: “The reason why Facebook was down for over 6 hours seems to have originated from their internal BGP route management system. After a configuration change, their automated system noticed a problem and started withdrawing routes to reach their DNS servers, bringing their servers offline. This caused a cascading failure, also preventing many other systems from functioning.”

Professor Perrig, who has close to seventy thousand citations to his name and was recently awarded the recognition of a digital shaper and Swiss cybersecurity guard, has been working on Internet security
research for over two decades. Professor Perrig and his team have been developing an alternative to the traditional internet as we know it in the form of a SCION-based internet architecture. He explains that such an architecture would have helped avert the Facebook outage, due to SCION’s independence from BGP route management systems.

His research indicates that, with SCION, entire classes of BGP faults and errors are not possible by design. This eliminates convergence processes, inconsistent routing tables, prefix hijacking attacks, and route leaks. 

“The reason why this outage would not have occurred in a SCION-based Internet is because SCION does not have the equivalent of a route withdrawal function. During a SCION network change, a new path is simply announced in addition, so all old paths and the new path can be used simultaneously. If the new path is not working, then any of the old working paths can still be used instantaneously. This fundamental difference provides much-improved availability, it also allows propagation of smaller incremental network updates instead of a high-impact change”, explains Professor Perrig

Although there are already numerous examples that exhibit the disadvantages of the outdated BGP protocol for critical uses, the Facebook outage exposes once again the weaknesses of the BGP protocol in a whole new light. To achieve highly reliable communication, experts like Professor Perrig and his team have been addressing the root cause of the Internet’s fundamental weakness for over a decade. Professor Perrig’s recent Tweet highlights what could possibly be on the minds of many cyber-experts, businesses, governments, and users alike: if the BGP can cause large-scale outages even for highly redundant services like Facebook - wouldn't it be nice to have a reliable Internet infrastructure?

 


Are you in Switzerland? Here´s a related article on NZZ