Cloudflare blames recent outage on BGP hijacking incident


Internet giant Cloudflare reports that its DNS resolver service, 1.1.1.1, was recently unreachable or degraded for some of its customers because of a combination of Border Gateway Protocol (BGP) hijacking and a route leak.

The incident occurred last week and affected 300 networks in 70 countries. Despite these numbers, the company says that the impact was “quite low” and in some countries users did not even notice it.

Incident details

Cloudflare says that at 18:51 UTC on June 27, Eletronet S.A. (AS267613) began announcing the 1.1.1.1/32 IP address to its peers and upstream providers.

Hijack
Source: Cloudflare

This incorrect announcement was accepted by multiple networks, including a Tier 1 provider, which treated it as a Remote Triggered Blackhole (RTBH) route.

The hijack occurred because BGP routing favors the most specific route. AS267613’s announcement of 1.1.1.1/32 was more specific than Cloudflare’s 1.1.1.0/24, leading networks to incorrectly route traffic to AS267613.

Consequently, traffic intended for Cloudflare’s 1.1.1.1 DNS resolver was blackholed/rejected, and hence, the service became unavailable for some users.

One minute later, at 18:52 UTC, Nova Rede de Telecomunicações Ltda (AS262504) erroneously leaked 1.1.1.0/24 upstream to AS1031, which propagated it further, affecting global routing.

Leak
Source: Cloudflare

This leak altered the normal BGP routing paths, causing traffic destined for 1.1.1.1 to be misrouted, compounding the hijacking problem and causing additional reachability and latency problems.

Cloudflare identified the problems at around 20:00 UTC and resolved the hijack roughly two hours later. The route leak was resolved at 02:28 UTC.

Remediation effort

Cloudflare’s first line of response was to engage with the networks involved in the incident while also disabling peering sessions with all problematic networks to mitigate the impact and prevent further propagation of incorrect routes.

The company explains that the incorrect announcements didn’t affect internal network routing due to adopting the Resource Public Key Infrastructure (RPKI), which led to automatically rejecting the invalid routes.

Long-term solutions Cloudflare presented in its postmortem write-up include:

  • Enhance route leak detection systems by incorporating more data sources and integrating real-time data points.
  • Promote the adoption of Resource Public Key Infrastructure (RPKI) for Route Origin Validation (ROV).
  • Promote the adoption of the Mutually Agreed Norms for Routing Security (MANRS) principles, which include rejecting invalid prefix lengths and implementing robust filtering mechanisms.
  • Encourage networks to reject IPv4 prefixes longer than /24 in the Default-Free Zone (DFZ).
  • Advocate for deploying ASPA objects (currently drafted by the IETF), which are used to validate the AS path in BGP announcements.
  • Explore the potential of implementing RFC9234 and Discard Origin Authorization (DOA).



Source link