The Cause Of Amazon’s Cloud Outage

Amazon Web Services (AWS) has explained the cause of their outage, which took down thousands of third-party online services for hours. Amazon say that, “the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration... As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.” 

While dozens of services were affected, AWS says the outage occurred in its Northern Virginia, US-East-1, region. It happened after a "small addition of capacity" to its front-end fleet of Kinesis servers. 

Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services and these services also saw impact during the shutdown. Kinesis is used by developers, as well as other AWS services like CloudWatch and Cognito authentication, to capture data and video streams and run them through AWS machine-learning platforms.  

The Kinesis service's front-end handles authentication, throttling, and distributes workloads to its back-end "workhorse" cluster via a database mechanism called sharding.  

Amazon’s additions to capacity triggered the outage but wasn't the root cause of it. AWS was adding capacity for an hour after 2:44am PST, and after that all the servers in Kinesis front-end fleet began to exceed the maximum number of threads allowed by its current operating system configuration.  The first alarm was triggered at 5:15am PST and AWS engineers spent the next five hours trying to resolve the issue. Kinesis was fully restored at 10:23pm PST. 

Amazon explains how the front-end servers distribute data across its Kinesis back-end: "Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map." According to AWS, that information is obtained through calls to a micro service vending the membership information, retrieval of configuration information from DynamoDB and continuous processing of messages from other Kinesis front-end servers. For Kinesis communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants." 

As the number of threads exceeded the OS configuration, the front-end servers ended up with "useless shard-maps" and were unable to route requests to Kinesis back-end clusters. AWS had already rolled back the additional capacity that triggered the event but had reservations about boosting the thread limit in case it delayed the recovery.  

As a first step, AWS has moved to larger CPU and memory servers, as well as reduced the total number of servers and threads required by each server to communicate across the fleet.  It's also testing an increase in thread count limits in its operating system configuration and working to "radically improve the cold-start time for the front-end fleet".  

CloudWatch and other large AWS services will move to a separate, partitioned front-end fleet. AWS is also working on a broader project to isolate failures in one service from affecting other services.  

AWS has also acknowledged the delays in updating its Service Health Dashboard during the incident, but says that was because the tool its support engineers use to update the public dashboard was affected by the outage. During that time, it was updating customers via the Personal Health Dashboard.   Amazon has apologised for the impact this event caused its customers.

Amazon:        Down Detector:       ZDNet

You Might Also Read:

The Risks &  Benefits Of Cloud Security:

 

« We Live In A Transient Internet
Orca Security Wants To Streamline Cloud Computing »

CyberSecurity Jobsite
Perimeter 81

Directory of Suppliers

Cyber Security Supplier Directory

Cyber Security Supplier Directory

Our Supplier Directory lists 6,000+ specialist cyber security service providers in 128 countries worldwide. IS YOUR ORGANISATION LISTED?

ZenGRC

ZenGRC

ZenGRC - the first, easy-to-use, enterprise-grade information security solution for compliance and risk management - offers businesses efficient control tracking, testing, and enforcement.

BackupVault

BackupVault

BackupVault is a leading provider of automatic cloud backup and critical data protection against ransomware, insider attacks and hackers for businesses and organisations worldwide.

IT Governance

IT Governance

IT Governance is a leading global provider of information security solutions. Download our free guide and find out how ISO 27001 can help protect your organisation's information.

Perimeter 81 / How to Select the Right ZTNA Solution

Perimeter 81 / How to Select the Right ZTNA Solution

Gartner insights into How to Select the Right ZTNA offering. Download this FREE report for a limited time only.

Malwarebytes

Malwarebytes

Malwarebytes provides artificial intelligence-powered technology that stops cyberattacks before they can compromise computers and endpoints.

National Centre of Incident Readiness & Strategy for Cybersecurity (NISC) - Japan

National Centre of Incident Readiness & Strategy for Cybersecurity (NISC) - Japan

NISC was established as a secretariat of the Cybersecurity Strategy Headquarters in collaboration with the public and private sectors to create a "free, fair and secure cyberspace" in Japan.

Ambersail

Ambersail

Ambersail provide Penetration Testing and Cyber Security Compliance services.

Secure Technology Alliance

Secure Technology Alliance

Secure Technology Alliance is a multi-industry association working to stimulate the adoption and widespread application of secure solutions.

Privacy Analytics

Privacy Analytics

Privacy Analytics enables healthcare organizations to unleash the value of sensitive data for secondary purposes without compromising personal health information.

Red Balloon Security (RBS)

Red Balloon Security (RBS)

Red Balloon Security is a leading embedded device security company, delivering deep host-based defense for all devices.

MACH37

MACH37

MACH37 is a market-centric cybersecurity accelerator program designed to facilitate the creation of the next generation of cybersecurity product companies.

Cyjax

Cyjax

Cyjax monitors the Internet to identify the digital risks to your organisation, including cyber threats, reputational risks and the Darknet.

BrainChip

BrainChip

BrainChip is the leading provider of neuromorphic computing solutions, a type of artificial intelligence that is inspired by the biology of the human neuron - spiking neural networks.

AAROH

AAROH

AAROH helps customers in Government, Law Enforcement, and Enterprises to identify, prevent, detect, resolve and protect from threats, crimes, breaches & fraud.

ShiftLeft

ShiftLeft

ShiftLeft is a continuous application security platform, purpose-built for the modern software development life cycle.

Careerjet

Careerjet

Careerjet is a leading online job search engine with a large presence worldwide, sourcing millions of job ads from thousands of websites from all over the world in areas including Cybersecurity.

CSC Digital Brand Services

CSC Digital Brand Services

Our brand protection and security expertise give our customers peace of mind that no matter how fast the digital world changes, their intellectual property and digital assets will be secure.

Cerby

Cerby

Your team uses unmanageable applications that put you, your company, and your data at risk. Protect, secure, and accelerate your business automatically with Cerby.

Obrela Security Industries

Obrela Security Industries

Obrela Security manage cyber exposure, risks and compliance. We identify, predict and prevent cyber threats in real time. As a service, personalised, on demand.

LOCH Technologies

LOCH Technologies

LOCH Wireless Machine Vision platform delivers next generation cybersecurity, performance monitoring, and cost management for all 5G and for broad-spectrum IoT, IoMT and OT wireless environments.