AWS Outage- Recovery and Resilience

Introduction: The Day the Cloud Stood Still

It begins with a familiar tremor in the digital world: a sudden spike in alerts, a cascade of red on monitoring dashboards, and the unnerving silence where a bustling website once stood. On October 20, 2025, this scenario became a reality for thousands of companies globally as Amazon Web Services (AWS), the backbone of a significant portion of the internet, experienced a prolonged and complex outage.[1] This was not a minor hiccup; it was a systemic failure that rippled across industries, silencing services from financial platforms like Venmo and Coinbase to workplace tools like Slack and Zoom, and even grounding airline operations.[1]

For us at chiragganguli.com, it was a real-time, high-stakes test of our infrastructure, our planning, and our resilience. This report is a firsthand account of navigating that storm. But it aims to be more than just a war story. It is a detailed post-mortem of the AWS outage and a comprehensive, “zero to hero” guide for architects, engineers, and decision-makers. We will dissect the anatomy of the failure, walk through our step-by-step recovery process, and then zoom out to provide a blueprint for building robust, fault-tolerant systems that can not only survive the next cloud outage but emerge stronger from it. This is the story of how a crisis became a powerful lesson in engineering for failure.

Decoding the Cascade: A Post-Mortem of the October 2025 AWS Outage

To understand how to build resilient systems, one must first understand how they break. The October 2025 outage was a masterclass in the complex, interconnected nature of modern cloud infrastructure, where a single fault can trigger a catastrophic chain reaction.

The Epicenter: US-EAST-1 and the DNS Failure

The disruption originated in AWS’s us-east-1 region, located in Northern Virginia.[1] As AWS’s oldest and largest data center hub, us-east-1 is foundational to many global services, but its very importance makes it a frequent hotspot for high-impact outages.[5]

The initial trigger, identified by Amazon at 12:26 AM PDT on October 20, was a Domain Name System (DNS) resolution issue affecting the API endpoint for DynamoDB, a core NoSQL database service.[7] DNS acts as the internet’s phone book; when it fails, applications lose their ability to find and communicate with critical services.[1] As one IT security officer noted, it’s a common trigger for major outages, leading to the engineering adage, “It’s always DNS”.[1] This seemingly minor infrastructure component was the first domino to fall.

The Ripple Effect: How One Service Toppled Many

Resolving the initial DNS issue was not the end of the crisis. The event revealed a series of deeply coupled dependencies within the AWS ecosystem, creating a cascading failure.

DynamoDB to EC2: Even after AWS resolved the DynamoDB DNS issue at 2:24 AM, a subsequent impairment crippled an internal subsystem of the Elastic Compute Cloud (EC2) service responsible for launching new virtual server instances. This happened because the EC2 subsystem itself had a critical dependency on the very DynamoDB service that was initially affected.[8]
EC2 to Network Services: The problems with EC2 instance launches then led to a third wave of failures. The internal system that monitors the health of Network Load Balancers (NLBs) became impaired. This, in turn, caused widespread network connectivity issues for a host of other services that rely on NLBs, including AWS Lambda, Simple Queue Service (SQS), and the CloudWatch monitoring service.[8]

The timeline shows a protracted recovery. The initial disruption began around 12:00 AM PDT, but full resolution across all services was not declared until 3:01 PM PDT, more than 15 hours later.[1] This long tail of recovery highlights that the problem was not a single fault but a series of interconnected system failures. A failure in a primitive service like DNS did not just impact that service; it set off a chain reaction that destabilized the entire regional stack.

Global Impact and Financial Fallout

The blast radius of the us-east-1 failure was immense. Outage-tracking site Downdetector recorded over 6.5 million user reports.[1] High-profile companies across every sector were affected, including Duolingo, Roblox, Fortnite, WhatsApp, Delta Air Lines, United Airlines, and major UK banks.[1]

The financial cost of such downtime is staggering. According to one analysis, major websites can lose millions per hour during an outage. For this specific event, estimated hourly losses included approximately $72.8 million for Amazon’s own retail site, $611,986 for Snapchat, and $532,580 for Zoom.[9] These figures underscore that cloud resilience is not just a technical concern but a critical business imperative.

In the Trenches: A chiragganguli.com Case Study

While the world watched the AWS status page, our team was executing a pre-planned disaster recovery (DR) strategy. Here is a look at how we weathered the storm.

The First Signs of Trouble: From Alert to Action

Our journey began not with a news report, but with an automated alert from Amazon CloudWatch signaling an anomalous rate of HTTP 5xx server errors. Our standard incident response protocol kicked in immediately. The initial check of our application logs and server health showed no internal issues, which immediately shifted our focus outward. A quick look at the AWS Health Dashboard [10] and third-party sites like Downdetector [1] confirmed our suspicion: this was not our problem alone, but a regional AWS event.

Executing the Failover: The Warm Standby in Action

Our resilience strategy for chiragganguli.com is built on a Warm Standby model. This is a well-established disaster recovery pattern where a scaled-down but fully functional copy of the primary production environment is kept running in a separate, independent AWS region.[11] Our primary site runs in us-east-1, while our standby environment resides in us-west-2.

With confirmation of a regional failure, we made the call to initiate a failover. The process was methodical and followed a practiced runbook:

DNS Switch: The first and most critical step was to redirect user traffic. We use Amazon Route 53 for DNS management, which is configured with a failover routing policy.[13] This policy continuously monitors the health of our primary endpoint in us-east-1. We manually triggered the failover, which updated the primary DNS A record to point from the us-east-1 load balancer to the load balancer in our us-west-2 standby region. Within minutes, traffic began flowing to our healthy backup site.[14]
Database Promotion: Our application is stateful, relying on an Amazon RDS database. In our primary region, we have a main database that handles all reads and writes. In the us-west-2 region, we maintain a cross-region read replica that asynchronously copies data from the primary.[15] To complete the failover, we promoted this read replica to become the new, standalone primary database, capable of handling write operations.[15]
Scaling Up: The “warm” in Warm Standby means our DR site runs with minimal resources to save costs. The final step was to trigger our auto-scaling rules in us-west-2, rapidly launching additional EC2 instances to handle the full production traffic load that was now being directed to it.

This sequence of actions allowed chiragganguli.com to be fully operational from our secondary region while the primary region was still in turmoil.

The Road Back: Failback and Lessons Learned

We continued to monitor the AWS Health Dashboard [10] for the official all-clear. Once AWS confirmed that all services in us-east-1 had returned to normal operations, we initiated a controlled failback. This process was essentially the reverse of our failover procedure: traffic was routed back to us-east-1 via Route 53, and the us-west-2 database was reconfigured as a read replica of the now-healthy primary. The real-world event served as the ultimate validation of our DR plan, confirming that a well-architected Warm Standby approach is a practical and effective defense against regional cloud failures.

The Architect’s Primer: From Availability Zones to Recovery Objectives

To build resilient systems, it’s essential to understand the fundamental concepts and terminology that underpin cloud architecture. The outage highlighted the critical difference between designing for high availability and planning for disaster recovery.

The Foundation: AWS Global Infrastructure

The AWS cloud is built on a hierarchy of infrastructure [16]:

Regions: A Region is a distinct physical, geographic location in the world, like Northern Virginia (us-east-1) or Oregon (us-west-2). Regions are designed to be completely isolated from each other.[17]
Availability Zones (AZs): Each Region consists of multiple, isolated data centers known as Availability Zones. Each AZ has independent power, cooling, and networking, and they are physically separated by a meaningful distance.[16]

An effective analogy is to think of a Region as a city and its AZs as separate suburbs, each with its own independent power grid. A power outage in one suburb (an AZ failure) should not take down the entire city (a regional failure). This distinction is crucial:

Multi-AZ architecture is for High Availability (HA). It protects your application from local failures, such as a single data center outage.
Multi-Region architecture is for Disaster Recovery (DR). It protects your application from large-scale events, like the October 2025 outage, that affect an entire region.

Defining Resilience: HA vs. Fault Tolerance vs. DR

These terms are often used interchangeably, but they describe different levels of resilience [20]:

High Availability (HA): Aims to minimize downtime through rapid recovery. Think of a car with a spare tire. If you get a flat, you’ll experience a short period of downtime while you change it, but you’ll be back on the road quickly.
Fault Tolerance (FT): Ensures uninterrupted operation even when a component fails. This is like a plane with multiple engines. If one engine fails, the others keep the plane flying with zero interruption and no downtime.
Disaster Recovery (DR): Is the plan for when a catastrophic failure occurs that overwhelms HA and FT measures. If your primary car is totaled in an accident, DR is the plan to get your backup car from the garage and continue your journey.

The Metrics That Matter: RTO and RPO Explained

The choice of a resilience strategy is not a purely technical decision; it is driven by two key business metrics:

Recovery Time Objective (RTO): This is the maximum acceptable downtime for your application. It answers the business question, “How long can we afford to be offline?”.[23]
Recovery Point Objective (RPO): This is the maximum acceptable amount of data loss, measured in time. It answers the business question, “How much data (e.g., how many minutes of transactions) can we afford to lose?”.[23]

For example, a critical e-commerce website might have an RTO of 15 minutes and an RPO of 1 minute. This means the business cannot tolerate being down for more than 15 minutes and cannot afford to lose more than one minute’s worth of customer orders. These business requirements directly dictate the necessary technical architecture. A low RTO/RPO demands a more complex and expensive solution, while a business that can tolerate an RTO of several hours can opt for a simpler, cheaper strategy.

A Blueprint for Bulletproof Architecture: Practical Mitigation Strategies

With the foundational concepts in place, we can now explore the practical strategies and AWS services used to build resilient applications. The right choice depends entirely on the RTO, RPO, and budget defined by the business.

Choosing Your Strategy: A Comparative Analysis

AWS supports four primary DR strategies, each offering a different balance of cost, complexity, and recovery speed.[11]

Strategy	Typical RTO	Typical RPO	Relative Cost	Implementation Complexity
Backup and Restore	Hours to Days	Hours	Lowest	Low
Pilot Light	Tens of Minutes to Hours	Minutes to Hours	Low	Medium
Warm Standby	Minutes to Tens of Minutes	Seconds to Minutes	Medium	Medium
Multi-Region Active/Active	Near-Zero	Near-Zero	Highest	High

The Resilience Toolkit: A Technical Deep-Dive

Implementing these strategies involves a combination of key AWS services:

DNS Failover with Amazon Route 53

Route 53 is the linchpin of any multi-region strategy. By configuring health checks and a failover routing policy, you can enable Route 53 to automatically detect when your primary application endpoint is unhealthy and reroute all user traffic to your standby region.[13] This provides a fast and reliable mechanism for redirecting users during an outage.[14]

Resilient Database Architecture

For stateful applications, the database is often the most complex component to make resilient. The key is to understand the difference between synchronous and asynchronous replication, as this directly maps to the difference between HA and DR.

For High Availability (Multi-AZ): Amazon RDS Multi-AZ deployments create a standby database instance in a different Availability Zone within the same region. Data is replicated synchronously, meaning a transaction is not complete until it is written to both the primary and standby instances.[15] This guarantees zero data loss (an RPO of zero) for an AZ failure, but it cannot protect against a full regional outage.
For Disaster Recovery (Multi-Region): Cross-Region Read Replicas create a copy of your database in a different AWS Region. Data is replicated asynchronously, meaning there is a small delay (replication lag) between when data is written to the primary and when it appears on the replica.[15] This enables recovery from a regional disaster but means there is a potential for minimal data loss, resulting in a non-zero RPO.

Automated Recovery with Infrastructure as Code (IaC)

Infrastructure as Code is the practice of managing and provisioning infrastructure using definition files, rather than manual configuration.[28] Tools like AWS CloudFormation and Terraform allow you to define your entire application stack—servers, load balancers, databases, and network settings—in a template.[30] In a disaster recovery scenario, IaC is invaluable. It enables you to reliably and repeatedly deploy a consistent copy of your infrastructure in the recovery region, dramatically reducing recovery time and eliminating the risk of human error. This automation is the engine that makes strategies like Pilot Light and Warm Standby feasible.[11]

Beyond Defense: Proactive Engineering for Resilience

A truly resilient organization does not just build a strong defense and wait for an attack. It proactively tests its systems, seeks out weaknesses, and continuously improves its posture.

Intelligent Observability: Monitoring and Alerting That Matters

Effective monitoring is the first line of defense. The goal is to move beyond basic metrics like CPU utilization and create actionable alarms in Amazon CloudWatch based on what truly matters to your users: application latency, error rates, and other key performance indicators.[32]

A common anti-pattern is “alarm fatigue,” where teams are inundated with so many notifications that they begin to ignore them.[34] The best practice is to only create an alarm if it is tied to a specific, pre-defined action in a runbook. For example, you can create a CloudWatch Alarm that monitors the HTTPCode_Target_5XX_Count metric from your Application Load Balancer. If this count exceeds a certain threshold for five minutes, the alarm can trigger a notification to an Amazon Simple Notification Service (SNS) topic, which in turn can alert your on-call team via email or a paging service.[35]

Controlled Chaos: An Introduction to Chaos Engineering

Disaster recovery plans are meaningless if they are not tested. Chaos Engineering is the discipline of running controlled experiments to proactively identify failures before they become outages.[38] It’s not about breaking things randomly; it’s about injecting precise, measured faults into a system to verify that it behaves as expected. The practice is often compared to a vaccine: injecting a small, controlled amount of harm to build immunity in the system.[40]

AWS Fault Injection Simulator (FIS) is a managed service designed for safely running these experiments.[42] For example, you can create an FIS experiment template that targets one of your production EC2 instances and runs the aws:ec2:stop-instances action.[43] The hypothesis might be: “If one web server stops, the load balancer health checks will fail, the instance will be removed from the pool, and user traffic will not be affected.” Running this experiment validates your entire resiliency chain, from the automated technical response to the monitoring and alerting that notifies your team.

This approach creates a virtuous cycle of continuous improvement: you monitor your system’s steady state, create alerts for deviations, test your assumptions with Chaos Engineering, and use the findings to improve your architecture. This is how an organization transitions from a reactive, firefighting culture to a proactive, resilient one.

Conclusion: Building for Failure

The October 2025 AWS outage was a stark reminder of a fundamental principle of modern systems design, famously articulated by AWS CTO Werner Vogels: “Everything fails, all the time.” The cascading nature of the failure demonstrated that even the most sophisticated cloud platforms are not immune to complex, unpredictable disruptions.

Our experience at chiragganguli.com validated the immense business value of investing in a well-architected and regularly tested disaster recovery plan. Our Warm Standby strategy performed exactly as designed, allowing us to restore service from a secondary region and mitigate the impact on our users.

The key takeaways are clear. Resilience is not an accident; it is a deliberate engineering practice. It begins with understanding the business needs by defining RTO and RPO. It is implemented through a layered strategy of high availability and disaster recovery, using tools like Route 53, multi-region database replication, and Infrastructure as Code. Finally, it is hardened and validated through proactive practices like intelligent monitoring and Chaos Engineering. The goal is not to prevent 100% of failures—an impossible task—but to build systems that anticipate, withstand, and gracefully recover from them. We encourage every engineering team to review their own architecture, ask the hard questions about their recovery objectives, and begin the journey of building for failure.

Works cited

Amazon Web Services outage: What brought the internet down across the world for more than 15 hours, accessed October 21, 2025, https://timesofindia.indiatimes.com/technology/tech-news/amazon-web-services-outage-what-brought-the-internet-down-across-the-world-for-more-than-15-hours/articleshow/124719537.cms
AWS outage triggers widespread issues on Alexa, Prime Video, Perplexity, Signal, other apps and websites, accessed October 21, 2025, https://www.businesstoday.in/world/story/aws-outage-triggers-widespread-issues-on-alexa-prime-video-perplexity-signal-other-apps-and-websites-498980-2025-10-20
Major Global Outage Impacts Amazon, Snapchat, Airline Websites, and More. What to Know, accessed October 21, 2025, https://time.com/7326950/global-internet-outage-amazon-web-services-websites-apps/
expert reaction to Amazon internet services outage | Science Media Centre, accessed October 21, 2025, https://www.sciencemediacentre.org/expert-reaction-to-amazon-internet-services-outage/
The History of AWS Outage - StatusGator, accessed October 21, 2025, https://statusgator.com/blog/aws-outage-history/
AWS Post-Event Summaries, accessed October 21, 2025, https://aws.amazon.com/premiumsupport/technology/pes/
Amazon Web Services outage: Company identifies potential root cause; says, ‘Based on our investigation …’, accessed October 21, 2025, https://timesofindia.indiatimes.com/technology/tech-news/amazon-web-services-outage-company-identifies-potential-root-cause-says-based-on-our-investigation-/articleshow/124703538.cms
Service health - Oct 20, 2025 | AWS Health Dashboard | Global, accessed October 21, 2025, https://health.aws.amazon.com/
Counting the cost of AWS outage: Shocking per hour figures revealed | Hindustan Times, accessed October 21, 2025, https://www.hindustantimes.com/world-news/us-news/aws-outage-how-much-can-amazon-web-services-failure-cost-shocking-figures-revealed-101760981540908.html
AWS Health Dashboard, accessed October 21, 2025, https://docs.aws.amazon.com/health/latest/ug/aws-health-dashboard-status.html
Disaster Recovery on AWS: 4 Strategies and How to Deploy Them, accessed October 21, 2025, https://cloudian.com/guides/disaster-recovery/disaster-recovery-on-aws-4-strategies-and-how-to-deploy-them/
Disaster Recovery (DR) Architecture on AWS, Part III: Pilot Light and Warm Standby, accessed October 21, 2025, https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iii-pilot-light-and-warm-standby/
Use Route 53 health checks for DNS failover | AWS re:Post, accessed October 21, 2025, https://repost.aws/knowledge-center/route-53-dns-health-checks
DNS Failover with Route53 - Medium, accessed October 21, 2025, https://medium.com/tysonworks/dns-failover-with-route53-cc3427a3629a
Implementing Data Replication Strategies for Disaster … - Firefly, accessed October 21, 2025, https://www.firefly.ai/academy/implementing-data-replication-strategies-for-disaster-recovery-in-the-cloud
AWS Global Infrastructure - Regions az, accessed October 21, 2025, https://aws.amazon.com/about-aws/global-infrastructure/regions_az/
REL10-BP01 Deploy the workload to multiple locations - Reliability Pillar - AWS Documentation, accessed October 21, 2025, https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_fault_isolation_multiaz_region_system.html
AWS multi-Region fundamentals - AWS Prescriptive Guidance - AWS Documentation, accessed October 21, 2025, https://docs.aws.amazon.com/prescriptive-guidance/latest/aws-multi-region-fundamentals/introduction.html
Multi-AZ vs. Multi-Region in the Cloud - FlashGrid, accessed October 21, 2025, https://www.flashgrid.io/news/multi-az-vs-multi-region-in-the-cloud/
www.couchbase.com, accessed October 21, 2025, https://www.couchbase.com/blog/high-availability-vs-fault-tolerance/#:~:text=High%20availability%20focuses%20on%20minimizing%20downtime%20through%20fast%20recovery%2C%20while,%2C%20complexity%2C%20and%20cost%20constraints.
High Availability vs. Fault Tolerance: Key Differences - Couchbase, accessed October 21, 2025, https://www.couchbase.com/blog/high-availability-vs-fault-tolerance/
High Availability vs Fault Tolerance vs Disaster Recovery – Explained with an Analogy, accessed October 21, 2025, https://www.freecodecamp.org/news/high-availability-fault-tolerance-and-disaster-recovery-explained/
www.druva.com, accessed October 21, 2025, https://www.druva.com/blog/understanding-rpo-and-rto#:~:text=RPO%20designates%20the%20variable%20amount,flow%20of%20normal%20business%20operations.
RTO (Recovery Time Objective) and RPO (Recovery Point Objective) | Explore - Commvault, accessed October 21, 2025, https://www.commvault.com/explore/rto-rpo
RTO vs. RPO: What’s the Difference and How are They Used? - Riskonnect, accessed October 21, 2025, https://riskonnect.com/business-continuity-resilience/rto-rpo-differences-and-uses/
Disaster recovery is different in the cloud - AWS Documentation, accessed October 21, 2025, https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-is-different-in-the-cloud.html
Manual Failover and Failback Strategy with Amazon Route53 | Networking & Content Delivery, accessed October 21, 2025, https://aws.amazon.com/blogs/networking-and-content-delivery/manual-failover-and-failback-strategy-with-amazon-route53/
Infrastructure as Code (IaC): A Beginner’s Guide 2023 - Turing, accessed October 21, 2025, https://www.turing.com/blog/infrastructure-as-code-iac-guide
Infrastructure as Code (IaC) - GeeksforGeeks, accessed October 21, 2025, https://www.geeksforgeeks.org/devops/what-is-infrastructure-as-code-iac/
What is Infrastructure as Code with Terraform? - HashiCorp Developer, accessed October 21, 2025, https://developer.hashicorp.com/terraform/tutorials/aws-get-started/infrastructure-as-code
Infrastructure as Code( IaC): A Comprehensive Guide - Kellton, accessed October 21, 2025, https://www.kellton.com/kellton-tech-blog/infrastructure-as-code-a-complete-guide
10 AWS Monitoring Best Practices in 2025 - Middleware Observability, accessed October 21, 2025, https://middleware.io/blog/aws-monitoring/best-practices/
Monitoring and alerting tools and best practices for Amazon RDS for MySQL and MariaDB, accessed October 21, 2025, https://docs.aws.amazon.com/prescriptive-guidance/latest/amazon-rds-monitoring-alerting/introduction.html
Alarms | AWS Observability Best Practices - GitHub Pages, accessed October 21, 2025, https://aws-observability.github.io/observability-best-practices/signals/alarms/
Guided Lab: Creating a CloudWatch Alarm - Tutorials Dojo Portal, accessed October 21, 2025, https://portal.tutorialsdojo.com/courses/playcloud-sandbox-aws/lessons/guided-lab-creating-a-cloudwatch-alarm/
A Step-by-Step Guide to Setting Up CloudWatch Alarms for AWS Monitoring - Medium, accessed October 21, 2025, https://medium.com/@aslam.muhammedclt/a-step-by-step-guide-to-setting-up-cloudwatch-alarms-for-aws-monitoring-66877304fafb
How to create an AWS CloudWatch alarm, accessed October 21, 2025, https://awsmadeeasy.com/blog/create-aws-cloudwatch-alarm/
www.opentext.com, accessed October 21, 2025, https://www.opentext.com/what-is/chaos-engineering#:~:text=Chaos%20engineering%20is%20not%20about,prevent%20outages%20and%20other%20disruptions.
What is Chaos Engineering? | IBM, accessed October 21, 2025, https://www.ibm.com/think/topics/chaos-engineering
Chaos Engineering - Gremlin, accessed October 21, 2025, https://www.gremlin.com/chaos-engineering
Chaos Engineering: the history, principles, and practice - Gremlin, accessed October 21, 2025, https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice
Tutorials for AWS Fault Injection Service, accessed October 21, 2025, https://docs.aws.amazon.com/fis/latest/userguide/fis-tutorials.html
Tutorial: Simulate a connectivity event - AWS Fault Injection Service, accessed October 21, 2025, https://docs.aws.amazon.com/fis/latest/userguide/fis-tutorial-disrupt-connectivity.html
AWS FIS (Fault Injection Simulator) for Chaos Engineering | by Christopher Adamson, accessed October 21, 2025, https://medium.com/@christopheradamson253/aws-fis-fault-injection-simulator-for-chaos-engineering-02eb538bdd0c