Massive Microsoft 365 outage caused by WAN router IP change


Microsoft says this week’s five-hour-long Microsoft 365 worldwide outage was caused by a router IP address change that led to packet forwarding issues between all other routers in its Wide Area Network (WAN).

Redmond said at the time that the outage resulted from DNS and WAN networking configuration issues caused by a WAN update and that users across all regions serviced by the impacted infrastructure were having problems accessing the affected Microsoft 365 services.

The issue led to service impact in waves, peaking approximately every 30 minutes as shared on the Microsoft Azure service status page (this status page was also affected as it intermittently displayed “504 Gateway Time-out” errors).

The list of services impacted by the outage included Microsoft Teams, Exchange Online, Outlook, SharePoint Online, OneDrive for Business, PowerBi, Microsoft 365 Admin Center, Microsoft Graph, Microsoft Intune, Microsoft Defender for Cloud Apps, and Microsoft Defender for Identity,

In all, it took Redmond over five hours to address the issue, from 7:05 AM UTC when it started investigating up until 12:43 PM UTC when service was restored.

“Between 07:05 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform,” Microsoft said in a preliminary post-incident report published today.

“While most regions and services had recovered by 09:00 UTC, intermittent packet loss issues were fully mitigated by 12:43 UTC. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud.”

Microsoft now also revealed that the issue was triggered when changing the IP address of a WAN router using a command that had not been thoroughly vetted and that has different behaviors on different network devices.

“As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables,” Microsoft said.

“During this re-computation process, the routers were unable to correctly forward packets traversing them.”

While the network began recovering on its own starting at 08:10 UTC, the automated systems responsible for maintaining the health of the wide area network (WAN) paused due to the impact on the network. 

These systems included those for identifying and eliminating unhealthy devices as well as traffic engineering systems for optimizing data flow across the network. 

As a result of the pause, some network paths continued experiencing increased packet loss from 9:35 UTC until the systems were manually restarted, returning the WAN to optimal operating conditions and completing the recovery process at 12:43 UTC.

Following this incident, Microsoft says that it’s now blocking highly impactful commands from being executed and that it will also require all command execution to follow guidelines for safe configuration changes.





Source link