AWS' DNS Failure: A Case for Cloud Resilience

Share this article
Share this article
Prioritise Us on Google
UK businesses are adopting AI every minute, reveals a new AWS study (Credit: AWS)
Following a major AWS outage caused by a DNS failure, experts discuss the risks of cloud dependency and the need for a multi-cloud resilience strategy

A major outage at Amazon Web Services (AWS) brought thousands of global services to a standstill, affecting popular applications such as Zoom, Slack and monday.com.

The incident highlights the fragility of cloud infrastructure and the significant operational risks businesses face due to their reliance on a single provider.

The subsequent disruption also reignites critical discussions among senior leaders about the necessity for multi-cloud strategies and robust resilience planning.

Youtube Placeholder

Following the widespread disruption, AWS conducted a root-cause analysis.

It confirmed that a fault within its internal automation systems was the trigger, leading to a series of Domain Name System (DNS) failures in its US-East-1 region, AWS' oldest and busiest data hub.

While the provider restored full service within hours, the event serves as an important reminder of the interconnectedness of digital services and the potential for single points of failure.

Analysing the AWS DNS failure

The specific problem originated from an error in a configuration automation process. This error prevented domain names from resolving correctly to their IP addresses within DynamoDB, a core data service for AWS.

According to Amazon’s post-event summary, the fault occurred after a routine update and “caused a backlog of messages that took several hours to process”.

The failure had a domino effect, disrupting connections for more than 1,000 interconnected sites and services worldwide, including those of major financial institutions like Lloyds Bank and Venmo.

Jamil Ahmed, Distinguished Engineer at Solace

The incident demonstrates how a single regional issue can have far-reaching consequences across multiple industries.

It froze financial transactions, blocked critical communication tools and took numerous streaming and shopping platforms offline.

In a statement, Amazon apologised for the disruption.

“We apologise for the impact this event caused our customers," it says.

"We know how critical our services are to our customers, their applications and end users and their businesses. We know this event impacted many customers in major ways.”

Cloud dependency and cyber risk exposure

For technology leaders, the outage is a clear signal that even hyperscale cloud providers are not infallible.

Jamil Ahmed, Distinguished Engineer at Solace, explains: “Even as cloud technology evolves, failures within the system will inevitably happen.

Christian Espinosa, Founder and CEO of Blue Goat Cyber

“'One-of-a-kind', extremely rare outages or issues continue to plague every service provider from time to time, which is why the need to store valuable information on multiple provider services, creating an event mesh, has arisen [...] It is now ‘later on’ and the strategy of using one cloud service is demonstrably dangerous and negligent.”

The fallout from such infrastructure failures extends beyond operational downtime. Cybersecurity experts highlight the increased risks that emerge during these events.

“This widespread outage is a stark reminder that even massive infrastructure providers are not immune to cascading failures,” says Christian Espinosa, Founder and CEO of Blue Goat Cyber.

“What makes it more dangerous for businesses is how these disruptions magnify cyber risk. When platforms go dark, organisations inadvertently shift into backup systems, remote tools are stressed and control lapses become exploitable.”

Jake Madders, Director and Co Founder at Hyve Managed Hosting

According to data from analysts at Ookla, more than 17 million outage reports were recorded globally in the first few hours, with most originating from US users connected to the affected AWS East Coast infrastructure.

Estimates from Deployflow suggest the downtime could have cost enterprises between US$5,000 and US$9,000 per minute.

Building infrastructure resilience and redundancy

Industry experts maintain that while outages are a risk, they can be managed with proactive strategies.

Jake Madders, Director and Co-Founder at Hyve Managed Hosting, advises on how businesses can avoid similar vulnerabilities.

“Even the largest and most reliable cloud providers can experience large-scale outages – but these risks can be mitigated,” he says.

Rob van Lubek, EMEA Vice President at Dynatrace

“The key lies in building resilience into your infrastructure from the outset. Diversifying across multiple cloud providers and geographic regions is essential to ensure redundancy and enable seamless failover when disruption occurs.”

Ultimately, the ability to recover swiftly from an outage is what can set a business apart. Visibility into system performance is a critical component of this capability.

Rob van Lubek, EMEA Vice President at Dynatrace, notes: “Global incidents like this are a clear reminder of how dependent our world has become on software and digital systems.

"The difference between disruption and recovery often comes down to visibility and speed – how fast an organisation can pinpoint what’s gone wrong, understand why and act to restore service continuity.”

Executives