Cloudflare makes changes to avoid repeat of 1.1.1.1 DNS outage

Internet infrastructure provider Cloudflare is making changes to avoid a repeat of a service outage that took down its popular 1.1.1.1 domain name system (DNS) resolver, affecting users globally.

The company wrote in a post-mortem that a seemingly innocuous configuration change, which went unnoticed for a month after it had been made, was behind the outage.

Cloudflare engineers were preparing for a future Data Localisation Suite (DLS) service, aimed at meeting compliance needs around data traffic routing, and accidentally included network prefixes for this, along with those used for the Border Gateway Protocol (BGP) associated with the 1.1.1.1 resolver.

Legacy systems that do not use a progressive, staged rollout worldwide and instead updated the configuration at every Cloudflare data centre were blamed for the outage.

Cloudflare is now implementing a plan that deprecates the legacy systems with their risky deployment methodologies.

The content delivery network and security provider will move to newer systems that use a gradual, staged deployment methodology instead.

Error lay dormant for a month

The misconfiguration caused an outage that lasted 62 minutes, from 7:52am to 8:54am AEST the following day), affecting the majority of 1.1.1.1 users worldwide and causing intermittent degradation for Cloudflare’s Gateway DNS service.

BGP serves as the internet’s routing system, allowing networks to advertise which Internet Protocol addresses they can reach.

With the DLS service not yet in production, no immediate impact or alerts appeared from the misconfiguration.

However, the dormant misconfiguration was triggered in July when engineers made another change to the same DLS service, adding an offline test location to the service topology.

That change triggered a global refresh of network configuration, which inadvertently included the 1.1.1.1 resolver prefixes due to the earlier error.

Cloudflare’s systems began withdrawing the 1.1.1.1 resolver prefixes from production data centres globally, which had the effect of making the service unreachable via BGP routing, causing outages for users.

The company said the July 1.1.1.1 outage was not caused by BGP hijacking, which is when a network wrongly or maliciously advertises traffic routes.

Cloudflare’s 1.1.1.1 server was launched in 2018, and handles over a trillion queries from more than 250 economies.

In June last year, Cloudflare’s 1.1.1.1 resolver became unreachable for 300 networks in 80 countries, due to mix of a BGP hijack and route leak.

Source link