Apart from stating the bleeding obvious, that the problem was caused by a BGP misconfiguration — which has been known by world+dog for more than a few hours — Facebook's Engineering and Infrastructure vice-president Santosh Janardhan said little, apart from offering an apology.
The outage, the second longest in the company's history, occurred from about 2.30am AEDT to about 8.20am AEDT on Tuesday.
While Facebook, a trillion-dollar company, could not offer a decent explanation, a much smaller firm, Web infrastructure and website security company Cloudflare, published a detailed blog post two hours after the outage was resolved.
"This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt."
October is now BGP Awareness Month.— Eva (@evacide) October 4, 2021
And then, he offered some platitudes: "Our services are now back online and we’re actively working [as opposed to passively working?] to fully return them to regular operations.
"We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime."
Facebook's last post on Twitter was 14 hours ago. Its media blog and its so-called newsroom haven't changed from what iTWire reported earlier; the latest item on its newsroom, however, does have a very fetching picture of a woman in sunglasses!
Today is your reminder that the internet runs on BGP and DNS, two protocols older than the spice girls.— EvilMog (@Evil_Mog) October 4, 2021
If you wanna be my router, you gotta BGP Peer with my friends, nothing lasts forever, that's just the way it is.
But in the absence of any details from Facebook, here is a little more from what one can only describe as an excellent post:
"BGP stands for Border Gateway Protocol. It's a mechanism to exchange routing information between autonomous systems (AS) on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the Internet routers wouldn't know what to do, and the Internet wouldn't work.
"The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet. As we write Facebook is not advertising its presence, ISPs and other networks can’t find Facebook’s network and so it is unavailable.
"The individual networks each have an ASN: an Autonomous System Number. An Autonomous System (AS) is an individual network with a unified internal routing policy. An AS can originate prefixes (say that they control a group of IP addresses), as well as transit prefixes (say they know how to reach specific groups of IP addresses).
"Cloudflare's ASN is AS13335. Every ASN needs to announce its prefix routes to the Internet using BGP; otherwise, no one will know how to connect and where to find us."
Someone on the Facebook recovery effort has explained that a routine BGP update went wrong, which in turn locked out those with remote access who could reverse the mistake. Those who do have physical access do not have authorization on the servers. Catch-22.— Steve Gibson (@SGgrc) October 4, 2021
But then one should not be surprised by Cloudflare providing such a lucid explanation when it was sorely needed; the company's chief executive, Matthew Prince, has a history of being open about routing errors that cause pain to ordinary users, even when they are caused by Cloudflare itself.
There have been a number of snafus involving BGP over the last few years: Google suffered in November 2018; Telstra took down a good part of the Internet in Australia the same month; a Verizon route leak took place in June 2019; a routing error by a Russian provider hit many big sites in April last year; and Telstra proved that one good turn deserves another by stuffing up again in October 2020.