On January 8, 2025, Microsoft Azure started experiencing a significant networking outage in the East US 2 region. This outage has impacted a broad range of Azure services (including Azure App Service, VMs, and more!) causing disruptions for businesses relying on these services for critical workloads. As of the time I’m writing this, the issue remains unresolved, and services continue to experience intermittent or complete downtime. This article aims to provide an overview of the incident, its impacts, and the steps organizations can take to mitigate risks in the future.
Outage Mitigated January 11, 2025 04:30 UTC: Microsoft has reported this issue has been fully mitigated, and all Azure services within East US 2 region are reported as “good”.
Root Cause and Nature of the Azure Networking Outage
According to Microsoft’s outage status updates, the outage began at approximately 22:00 UTC on January 8, 2025, when a configuration change in a regional networking service caused an inconsistent service state in the East US 2 region. This triggered widespread connectivity issues, failures in resource allocation, and disruptions in communication between services. The issue is localized to a single zone within the East US 2 region, but its ripple effects have significantly impacted a range of Azure services.
Microsoft identified the root cause as a network configuration issue that rendered three storage partitions in the region unhealthy. While mitigation efforts, such as rerouting traffic and rehydrating the impacted storage partitions, have been partially successful, some services and customers remain affected. The company has advised affected customers to execute Disaster Recovery (DR) procedures to minimize downtime.
What happened, according to Microsoft:
“Between 22:00 UTC on 08 Jan 2025 and 04:30 UTC on 11 Jan 2025, a networking issue in a single zone in East US 2 resulted in impact to multiple Azure Services in the region. This may have resulted in intermittent Virtual Machine connectivity issues, failures in allocating resources or communicating with resources in the hosted region. The services impacted include but were not limited to, Azure Databricks, Azure Container Apps, Azure Function Apps, Azure App Service, Azure Logic Apps, SQL Managed Instances, Azure Databricks, Azure Synapse, Azure Data Factory, Azure Container Instances, API Management, Azure NetApp Files, DevOps, Azure Stream Analytics, PowerBI, VMSS, PostgreSQL flexible servers, and Azure RedHat Openshift. Customers using resources with Private Endpoint Network Security Groups communicating with other services may have also been impacted.
The impact was limited to a single zone in East US 2 region. No other regions were impacted by this issue.”
History of the Status from Microsoft
- January 8, 2025 – Outage Begins (22:00 UTC)
- The issue was first detected, service monitoring systems detected a networking issue in East US 2, impacting multiple services.
- Initial investigations identified a network configuration issue in one of the zones, which caused three Storage partitions to be come unhealthy.
- January 9, 2025 – Traffic Rerouted
- Microsoft rerouted traffic away from the affected zone to alleviate the impact on non-zonal services and facilitate new resource allocations.
- Services relying on zonal requests to the impacted zone remained unhealthy.
- Some impacted services initiated their own Disaster Recovery options to mitigate the effects.
- January 10, 2025 – Partial Recovery (14:36 UTC)
- Two of the three affected storage partitions were restored. Some services regained partial functionality.
- Zonal requests to the impacted zone and services reliant on these partitions remain unhealthy.
- A patch was applied for Private Link, restoring functionality for some dependent services.
- January 10, 2025 – Full Partition Recovery (16:39 UTC)
- All three affected Storage partitions were brought back online following additional workstreams to rehydrate the impacted zone.
- Significant signs of recovery were observed; however, individual services were still in the process of recovery.
- Some customers continued to experience intermittent errors and degraded performance as services regained full functionality.
- Customers were advised to execute Disaster Recovery options and avoid failing back to the affected region until the incident is fully mitigated.
- January 10, 2025 (00:30 UTC)
- A phased approach to rebalance traffic across all of the zones to ensure networking traffic is flowing as expected in the region.
- January 11, 2025 – Fully Mitigated (04:30 UTC)
- After monitoring service health, Microsoft determined the incident is fully mitigated.
The latest update, as of January 11, 2025 04:30 UTC, indicates the issue is fully mitigated, and all Azure services are reported to be “good.”
Services Impacted
The outage affects the following Azure services in the East US 2 region:
- App Service
- Azure Container Apps
- Azure Container Instances
- Azure Container Services
- Azure Data Explorer
- Azure Data Factory
- Azure Database for PostgreSQL flexible servers
- Azure Databricks
- Azure Functions
- Azure NetApp Files
- Azure SQL Managed Instance
- Azure Synapse Analytics
- Network Infrastructure
- Virtual Machine Scale Sets
- Virtual Machines
Organizations using private endpoints and Network Security Groups (NSGs) for communication between services have also reported issues.
Strategies for Ensuring Resilience Through Multi-Region Deployments
This incident highlights the critical importance of designing cloud workloads with resilience in mind. Businesses can protect against regional outages by leveraging multi-region deployments, which provide the following benefits:
- Redundancy: Services deployed across multiple regions can failover to an unaffected region in the event of an outage.
- Improved Disaster Recovery: DR plans can be executed seamlessly if workloads are distributed across regions.
- Minimized Downtime: Geographic distribution ensures continued service availability even when a single region faces issues.
To implement multi-region deployments effectively, organizations should consider the following:
- Strategic Service Placement: Place critical resources in geographically dispersed regions to reduce the risk of a single point of failure.
- Global Traffic Management: Use services like Azure Front Door or Azure Traffic Manager to distribute traffic intelligently across regions based on availability and performance.
- Data Replication: Utilize features like Geo-Replication for Azure Storage or Global Database replication with Azure Cosmos DB to ensure data is accessible in alternate regions.
- Automated Failover: Design systems to detect outages and trigger automatic failover mechanisms to alternate regions without requiring manual intervention.
- Regular Testing: Conduct regular disaster recovery drills to validate that failover and redundancy mechanisms are working as expected.
By adopting these practices, businesses can enhance their resilience, reduce downtime, and ensure a seamless experience for end-users even during unexpected outages.
Summary
The networking outage in Azure’s East US 2 region has disrupted a wide array of services, affecting countless organizations. Microsoft’s efforts to resolve the issue are progressing, but the incident underscores the vulnerability of single-region deployments. Businesses should use this as a wake-up call to prioritize resilience in their cloud architectures by adopting multi-region strategies.
Organizations should review and refine their Disaster Recovery and failover plans to ensure preparedness for any similar outages in the future.
Original Article Source: Major Azure Networking Outage in East US 2 Affecting VMs, App Service, and More! (January 8 – 11, 2025) by Chris Pietschmann (If you’re reading this somewhere other than Build5Nines.com, it was republished without permission.)