The Cause Of Amazon’s Cloud Outage

Amazon Web Services (AWS) has explained the cause of their outage, which took down thousands of third-party online services for hours. Amazon say that, “the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration... As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.” 

While dozens of services were affected, AWS says the outage occurred in its Northern Virginia, US-East-1, region. It happened after a "small addition of capacity" to its front-end fleet of Kinesis servers. 

Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services and these services also saw impact during the shutdown. Kinesis is used by developers, as well as other AWS services like CloudWatch and Cognito authentication, to capture data and video streams and run them through AWS machine-learning platforms.  

The Kinesis service's front-end handles authentication, throttling, and distributes workloads to its back-end "workhorse" cluster via a database mechanism called sharding.  

Amazon’s additions to capacity triggered the outage but wasn't the root cause of it. AWS was adding capacity for an hour after 2:44am PST, and after that all the servers in Kinesis front-end fleet began to exceed the maximum number of threads allowed by its current operating system configuration.  The first alarm was triggered at 5:15am PST and AWS engineers spent the next five hours trying to resolve the issue. Kinesis was fully restored at 10:23pm PST. 

Amazon explains how the front-end servers distribute data across its Kinesis back-end: "Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map." According to AWS, that information is obtained through calls to a micro service vending the membership information, retrieval of configuration information from DynamoDB and continuous processing of messages from other Kinesis front-end servers. For Kinesis communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants." 

As the number of threads exceeded the OS configuration, the front-end servers ended up with "useless shard-maps" and were unable to route requests to Kinesis back-end clusters. AWS had already rolled back the additional capacity that triggered the event but had reservations about boosting the thread limit in case it delayed the recovery.  

As a first step, AWS has moved to larger CPU and memory servers, as well as reduced the total number of servers and threads required by each server to communicate across the fleet.  It's also testing an increase in thread count limits in its operating system configuration and working to "radically improve the cold-start time for the front-end fleet".  

CloudWatch and other large AWS services will move to a separate, partitioned front-end fleet. AWS is also working on a broader project to isolate failures in one service from affecting other services.  

AWS has also acknowledged the delays in updating its Service Health Dashboard during the incident, but says that was because the tool its support engineers use to update the public dashboard was affected by the outage. During that time, it was updating customers via the Personal Health Dashboard.   Amazon has apologised for the impact this event caused its customers.

Amazon:        Down Detector:       ZDNet

You Might Also Read:

The Risks &  Benefits Of Cloud Security:

 

« We Live In A Transient Internet
Orca Security Wants To Streamline Cloud Computing »

ManageEngine
CyberSecurity Jobsite
Check Point

Directory of Suppliers

MIRACL

MIRACL

MIRACL provides the world’s only single step Multi-Factor Authentication (MFA) which can replace passwords on 100% of mobiles, desktops or even Smart TVs.

Clayden Law

Clayden Law

Clayden Law advise global businesses that buy and sell technology products and services. We are experts in information technology, data privacy and cybersecurity law.

Syxsense

Syxsense

Syxsense brings together endpoint management and security for greater efficiency and collaboration between IT management and security teams.

Resecurity

Resecurity

Resecurity is a cybersecurity company that delivers a unified platform for endpoint protection, risk management, and cyber threat intelligence.

Authentic8

Authentic8

Authentic8 transforms how organizations secure and control the use of the web with Silo, its patented cloud browser.

Hack in the Box Security Conference (HitBSecConf)

Hack in the Box Security Conference (HitBSecConf)

HITBSecConf is a platform for the discussion and dissemination of next generation computer security issues. Our events feature two days of training and a two-day multi-track conference

SysTools

SysTools

SysTools provides a range of services including data recovery, digital forensics, and cloud backup solutions.

Khipu Networks

Khipu Networks

Khipu Networks is an award winning Cyber Security Company delivering a wide range of network, wireless and security solutions, technologies and services across multiple sectors.

Alsid

Alsid

Alsid helps corporates to anticipate attacks by detecting breaches before hackers can exploit them.

BackupVault

BackupVault

BackupVault is a leading provider of automatic cloud backup and critical data protection against ransomware, insider attacks and hackers for businesses and organisations worldwide.

Open Systems

Open Systems

Open Systems is a Secure Access Service Edge (SASE) pioneer delivering a complete solution to network and security.

OWN

OWN

OWN (formerly SEKOIA) is a major French player in cybersecurity providing tailor-made, informed and adapted cyber support thanks to its DNA of passionate and committed experts.

HancomWITH

HancomWITH

Hancomwith is an information security company. We provide optimized blockchain solutions in areas including next-generation authentication, security and digital asset transaction.

TestArmy

TestArmy

TestArmy CyberForces provide you with a broad spectrum of cybersecurity services to test every aspect of your IT infrastructure security and software development process.

Start Left® Security

Start Left® Security

From Posture to Performance—The System That Improves How Software Gets Built.

Dataships

Dataships

We help companies automate their privacy compliance while building healthy, transparent data relationships with their customers.

Mailinblack

Mailinblack

Mailinblack protects your organisation against email threats with an innovative solution that meets your security requirements.

CCX Technologies

CCX Technologies

CCX Technologies design and develop a wide range of cybersecurity and testing solutions for the aviation, and military and government markets.

Cyber Capital Partners

Cyber Capital Partners

Cyber Capital Partners build strategic and financial partnerships with small and mid-sized cybersecurity companies in highly regulated markets.

SafeLiShare

SafeLiShare

SafeLiShare’s data security platform unifies encryption strategies for organizations with hybrid and multi-cloud infrastructures, ensuring data is secure regardless of its location.

AuthenticID

AuthenticID

Our mission at AuthenticID is to combat fraud worldwide and help businesses protect their enterprise and valuable data assets.