The Cause Of Amazon’s Cloud Outage

Amazon Web Services (AWS) has explained the cause of their outage, which took down thousands of third-party online services for hours. Amazon say that, “the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration... As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.” 

While dozens of services were affected, AWS says the outage occurred in its Northern Virginia, US-East-1, region. It happened after a "small addition of capacity" to its front-end fleet of Kinesis servers. 

Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services and these services also saw impact during the shutdown. Kinesis is used by developers, as well as other AWS services like CloudWatch and Cognito authentication, to capture data and video streams and run them through AWS machine-learning platforms.  

The Kinesis service's front-end handles authentication, throttling, and distributes workloads to its back-end "workhorse" cluster via a database mechanism called sharding.  

Amazon’s additions to capacity triggered the outage but wasn't the root cause of it. AWS was adding capacity for an hour after 2:44am PST, and after that all the servers in Kinesis front-end fleet began to exceed the maximum number of threads allowed by its current operating system configuration.  The first alarm was triggered at 5:15am PST and AWS engineers spent the next five hours trying to resolve the issue. Kinesis was fully restored at 10:23pm PST. 

Amazon explains how the front-end servers distribute data across its Kinesis back-end: "Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map." According to AWS, that information is obtained through calls to a micro service vending the membership information, retrieval of configuration information from DynamoDB and continuous processing of messages from other Kinesis front-end servers. For Kinesis communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants." 

As the number of threads exceeded the OS configuration, the front-end servers ended up with "useless shard-maps" and were unable to route requests to Kinesis back-end clusters. AWS had already rolled back the additional capacity that triggered the event but had reservations about boosting the thread limit in case it delayed the recovery.  

As a first step, AWS has moved to larger CPU and memory servers, as well as reduced the total number of servers and threads required by each server to communicate across the fleet.  It's also testing an increase in thread count limits in its operating system configuration and working to "radically improve the cold-start time for the front-end fleet".  

CloudWatch and other large AWS services will move to a separate, partitioned front-end fleet. AWS is also working on a broader project to isolate failures in one service from affecting other services.  

AWS has also acknowledged the delays in updating its Service Health Dashboard during the incident, but says that was because the tool its support engineers use to update the public dashboard was affected by the outage. During that time, it was updating customers via the Personal Health Dashboard.   Amazon has apologised for the impact this event caused its customers.

Amazon:        Down Detector:       ZDNet

You Might Also Read:

The Risks &  Benefits Of Cloud Security:

 

« We Live In A Transient Internet
Orca Security Wants To Streamline Cloud Computing »

CyberSecurity Jobsite
Perimeter 81

Directory of Suppliers

North Infosec Testing (North IT)

North Infosec Testing (North IT)

North IT (North Infosec Testing) are an award-winning provider of web, software, and application penetration testing.

CYRIN

CYRIN

CYRIN® Cyber Range. Real Tools, Real Attacks, Real Scenarios. See why leading educational institutions and companies in the U.S. have begun to adopt the CYRIN® system.

TÜV SÜD Academy UK

TÜV SÜD Academy UK

TÜV SÜD offers expert-led cybersecurity training to help organisations safeguard their operations and data.

LockLizard

LockLizard

Locklizard provides PDF DRM software that protects PDF documents from unauthorized access and misuse. Share and sell documents securely - prevent document leakage, sharing and piracy.

NordLayer

NordLayer

NordLayer is an adaptive network access security solution for modern businesses — from the world’s most trusted cybersecurity brand, Nord Security. 

SecDev

SecDev

SecDev is a consulting firm working at the intersection of geopolitical, digital, urban, energy and cyber risk.

ACI Solutions

ACI Solutions

ACI Solutions is a managed IT services and network security provider working with diverse global commercial, government and public sector clients.

Auth0

Auth0

Auth0 is a cloud service that provides a set of unified APIs and tools that instantly enables single sign-on and user management for any application, API or IoT device.

Cryptika

Cryptika

Cryptika is a fully integrated IT security and managed services provider, specialized in Next-Generation Cyber Security Technologies.

PixelPlex

PixelPlex

PixelPlex is a blockchain and custom software development company with offices and developers in New York, Geneva, and Seoul.

Towerwall

Towerwall

Towerwall offers a comprehensive suite of security services and solutions using best-of-breed tools and information security services.

Barikat Cyber Security

Barikat Cyber Security

Barikat is a provider of information security solution and services including security analysis and compliance, security testing, managed security services, incident response and training.

Help AG

Help AG

Help AG provides leading enterprise businesses and governments across the Middle East with strategic consultancy combined with tailored information security solutions and services.

Trapp Technology

Trapp Technology

Trapp Technology combines the very best cloud, Internet, IT managed services, and IT consulting to provide a true all-in-one IT solution for small to mid-sized businesses.

DeNexus

DeNexus

DeNexus is the leading provider of cyber risk modeling for industrial networks. Our Mission is to build the Global Standard for Industrial Cyber Risk Quantification.

NACVIEW

NACVIEW

NACVIEW is a Network Access Control solution. It allows to control endpoints and identities that try to access the network - wired and wireless, including VPN connections.

Olympix

Olympix

Dev-first Web3 security that starts at the source. Olympix is a pioneering DevSecOps tool that puts security in the hands of the developer by proactively securing code from day one.

Myrror Security

Myrror Security

Myrror Security is a software supply chain security solution that aids lean security teams in safeguarding their software against breaches.

Panoptic Cyber

Panoptic Cyber

Panoptic Cyber are a team of elite Armed Forces Veterans who hold a wealth of experience in Information Security, Cyber Security, Data Protection and Risk Management.

Hunt & Hackett

Hunt & Hackett

Hunt & Hackett helps European companies prevent, detect and respond to today’s most advanced adversaries, safeguarding them against cyberthreats and espionage.

XY Cyber

XY Cyber

XY Cyber enable Generative AI for Cyber Operations. We simplify the complex world of cyber threats into actionable strategies, empowering your defense with AI-powered solutions.