AWS' DNS Failure: A Case for Cloud Resilience

A major outage at Amazon Web Services (AWS) brought thousands of global services to a standstill, affecting popular applications such as Zoom, Slack and monday.com.
The incident highlights the fragility of cloud infrastructure and the significant operational risks businesses face due to their reliance on a single provider.
The subsequent disruption also reignites critical discussions among senior leaders about the necessity for multi-cloud strategies and robust resilience planning.
Following the widespread disruption, AWS conducted a root-cause analysis.
It confirmed that a fault within its internal automation systems was the trigger, leading to a series of Domain Name System (DNS) failures in its US-East-1 region, AWS' oldest and busiest data hub.
While the provider restored full service within hours, the event serves as an important reminder of the interconnectedness of digital services and the potential for single points of failure.
Analysing the AWS DNS failure
The specific problem originated from an error in a configuration automation process. This error prevented domain names from resolving correctly to their IP addresses within DynamoDB, a core data service for AWS.
According to Amazonâs post-event summary, the fault occurred after a routine update and âcaused a backlog of messages that took several hours to processâ.
The failure had a domino effect, disrupting connections for more than 1,000 interconnected sites and services worldwide, including those of major financial institutions like Lloyds Bank and Venmo.
The incident demonstrates how a single regional issue can have far-reaching consequences across multiple industries.
It froze financial transactions, blocked critical communication tools and took numerous streaming and shopping platforms offline.
In a statement, Amazon apologised for the disruption.
âWe apologise for the impact this event caused our customers," it says.
"We know how critical our services are to our customers, their applications and end users and their businesses. We know this event impacted many customers in major ways.â
Cloud dependency and cyber risk exposure
For technology leaders, the outage is a clear signal that even hyperscale cloud providers are not infallible.
Jamil Ahmed, Distinguished Engineer at Solace, explains: âEven as cloud technology evolves, failures within the system will inevitably happen.
â'One-of-a-kind', extremely rare outages or issues continue to plague every service provider from time to time, which is why the need to store valuable information on multiple provider services, creating an event mesh, has arisen [...] It is now âlater onâ and the strategy of using one cloud service is demonstrably dangerous and negligent.â
The fallout from such infrastructure failures extends beyond operational downtime. Cybersecurity experts highlight the increased risks that emerge during these events.
âThis widespread outage is a stark reminder that even massive infrastructure providers are not immune to cascading failures,â says ChristianâŻEspinosa, Founder and CEO of Blue Goat Cyber.
âWhat makes it more dangerous for businesses is how these disruptions magnify cyber risk. When platforms go dark, organisations inadvertently shift into backup systems, remote tools are stressed and control lapses become exploitable.â
According to data from analysts at Ookla, more than 17 million outage reports were recorded globally in the first few hours, with most originating from US users connected to the affected AWS East Coast infrastructure.
Estimates from Deployflow suggest the downtime could have cost enterprises between US$5,000 and US$9,000 per minute.
Building infrastructure resilience and redundancy
Industry experts maintain that while outages are a risk, they can be managed with proactive strategies.
Jake Madders, Director and Co-Founder at Hyve Managed Hosting, advises on how businesses can avoid similar vulnerabilities.
âEven the largest and most reliable cloud providers can experience large-scale outages â but these risks can be mitigated,â he says.
âThe key lies in building resilience into your infrastructure from the outset. Diversifying across multiple cloud providers and geographic regions is essential to ensure redundancy and enable seamless failover when disruption occurs.â
Ultimately, the ability to recover swiftly from an outage is what can set a business apart. Visibility into system performance is a critical component of this capability.
Rob vanâŻLubek, EMEA Vice President at Dynatrace, notes: âGlobal incidents like this are a clear reminder of how dependent our world has become on software and digital systems.
"The difference between disruption and recovery often comes down to visibility and speed â how fast an organisation can pinpoint whatâs gone wrong, understand why and act to restore service continuity.â
- Ivanti's Chris Goettl: Preparing for the AI Patch ApocalypseTechnology & AI
- Why Cloud Native 2.0 is a Necessity for Agentic EnterprisesCloud Security
- CrowdStrike Dismantles Developer-Targeting Glassworm BotnetHacking & Malware
- How Experian and Resistant AI Tackle Financial CrimeCyber Security







