On July 30, 2024, Microsoft experienced a significant global outage affecting its Azure cloud services and Microsoft 365 products. The incident, which lasted nearly 10 hours, was triggered by a Distributed Denial-of-Service (DDoS) attack and impacted users worldwide.
The outage began at approximately 11:45 UTC and was resolved by 19:43 UTC. During this period, users reported difficulties accessing various Microsoft services, including Azure App Services, Application Insights, Azure IoT Central, Azure Log Search Alerts, Azure Policy, the Azure portal, and several Microsoft 365 and Microsoft Purview services.
Microsoft confirmed that the initial trigger was a DDoS attack, which caused an unexpected usage spike. This surge overwhelmed Azure Front Door (AFD) components and Azure Content Delivery Network (CDN), leading to intermittent errors, timeouts, and latency spikes.
How to Build a Security Framework With Limited Resources IT Security Team (PDF) - Free Guide
A flaw in Microsoft’s defense made the situation even worse than expected. The company stated, “While the initial trigger event was a Distributed Denial-of-Service (DDoS) attack, initial investigations suggest that an error in the implementation of our defenses amplified the impact of the attack rather than mitigating it.”
Microsoft’s response included implementing networking configuration changes and performing failovers to alternate networking paths. The initial mitigation efforts successfully addressed the majority of the impact by 14:10 UTC. However, some customers continued to experience less than 100% availability until around 18:00 UTC.
The tech giant then proceeded with an updated mitigation approach, rolling it out first across regions in Asia Pacific and Europe, followed by the Americas. Failure rates returned to pre-incident levels by 19:43 UTC, with full mitigation declared at 20:48 UTC.
This incident follows a series of recent outages affecting Microsoft’s services. Just two weeks prior, a problematic update from CrowdStrike’s Falcon agent caused Windows virtual machines to BSOD Errors. These recurring issues have raised concerns about cloud infrastructure resilience and the potential risks associated with centralized services.
The outage had widespread effects, impacting various businesses globally. For instance, Starbucks in the US had to disable its mobile ordering system for several hours due to the Azure issues.
Microsoft has committed to conducting an internal retrospective to understand the incident better. The company plans to publish a Preliminary Post-Incident Review within 72 hours, followed by a Final Post-Incident Review within 14 days, providing additional details and lessons learned from the event.
Are you from SOC and DFIR Teams? – Analyse Malware Incidents & get live Access with ANY.RUN -> Free Access