The Cause Of Amazon’s Cloud Outage

Amazon Web Services (AWS) has explained the cause of their outage, which took down thousands of third-party online services for hours. Amazon say that, “the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration... As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.” 

While dozens of services were affected, AWS says the outage occurred in its Northern Virginia, US-East-1, region. It happened after a "small addition of capacity" to its front-end fleet of Kinesis servers. 

Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services and these services also saw impact during the shutdown. Kinesis is used by developers, as well as other AWS services like CloudWatch and Cognito authentication, to capture data and video streams and run them through AWS machine-learning platforms.  

The Kinesis service's front-end handles authentication, throttling, and distributes workloads to its back-end "workhorse" cluster via a database mechanism called sharding.  

Amazon’s additions to capacity triggered the outage but wasn't the root cause of it. AWS was adding capacity for an hour after 2:44am PST, and after that all the servers in Kinesis front-end fleet began to exceed the maximum number of threads allowed by its current operating system configuration.  The first alarm was triggered at 5:15am PST and AWS engineers spent the next five hours trying to resolve the issue. Kinesis was fully restored at 10:23pm PST. 

Amazon explains how the front-end servers distribute data across its Kinesis back-end: "Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map." According to AWS, that information is obtained through calls to a micro service vending the membership information, retrieval of configuration information from DynamoDB and continuous processing of messages from other Kinesis front-end servers. For Kinesis communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants." 

As the number of threads exceeded the OS configuration, the front-end servers ended up with "useless shard-maps" and were unable to route requests to Kinesis back-end clusters. AWS had already rolled back the additional capacity that triggered the event but had reservations about boosting the thread limit in case it delayed the recovery.  

As a first step, AWS has moved to larger CPU and memory servers, as well as reduced the total number of servers and threads required by each server to communicate across the fleet.  It's also testing an increase in thread count limits in its operating system configuration and working to "radically improve the cold-start time for the front-end fleet".  

CloudWatch and other large AWS services will move to a separate, partitioned front-end fleet. AWS is also working on a broader project to isolate failures in one service from affecting other services.  

AWS has also acknowledged the delays in updating its Service Health Dashboard during the incident, but says that was because the tool its support engineers use to update the public dashboard was affected by the outage. During that time, it was updating customers via the Personal Health Dashboard.   Amazon has apologised for the impact this event caused its customers.

Amazon:        Down Detector:       ZDNet

You Might Also Read:

The Risks &  Benefits Of Cloud Security:

 

« We Live In A Transient Internet
Orca Security Wants To Streamline Cloud Computing »

CyberSecurity Jobsite
Check Point

Directory of Suppliers

The PC Support Group

The PC Support Group

A partnership with The PC Support Group delivers improved productivity, reduced costs and protects your business through exceptional IT, telecoms and cybersecurity services.

Jooble

Jooble

Jooble is a job search aggregator operating in 71 countries worldwide. We simplify the job search process by displaying active job ads from major job boards and career sites across the internet.

LockLizard

LockLizard

Locklizard provides PDF DRM software that protects PDF documents from unauthorized access and misuse. Share and sell documents securely - prevent document leakage, sharing and piracy.

ZenGRC

ZenGRC

ZenGRC (formerly Reciprocity) is a leader in the GRC SaaS landscape, offering robust and intuitive products designed to make compliance straightforward and efficient.

IT Governance

IT Governance

IT Governance is a leading global provider of information security solutions. Download our free guide and find out how ISO 27001 can help protect your organisation's information.

Global Digital Forensics (GDF)

Global Digital Forensics (GDF)

GDF specialise in Digital Forensics and e-Discovery. Other services include Data Breach Response and Cyber Security.

Global Learning Systems (GLS)

Global Learning Systems (GLS)

Global Learning Systems provides security awareness and compliance training programs for employees that effectively promote behavior change and protect your organization.

X-act Forensics

X-act Forensics

X-act forensics are computer forensic experts with experience in cases of computer fraud, intellectual property theft, and social networking cases.

BlueFiles

BlueFiles

BlueFiles enables users to send encrypted files securely while maintaining full control over recipients, access periods, downloads, and printing.

Abnormal Security

Abnormal Security

Abnormal is an API-based email security platform providing protection against the entire spectrum of targeted email attacks.

Business Hive Vilnius (BHV)

Business Hive Vilnius (BHV)

BHV is one of the oldest startup incubator and technology hubs in the Baltics, primarily focused on hardware, security, blockchain, AI, fintech and enterprise software.

Nu Quantum

Nu Quantum

Nu Quantum is developing quantum photonics hardware to power the quantum revolution in communications, sensing and computing.

CyberSheath Services International

CyberSheath Services International

CyberSheath integrates your compliance and threat mitigation efforts and eliminates redundant security practices that don’t improve and in fact might probably weaken your security posture.

EYE Security

EYE Security

EYE provides enterprise-grade cyber security services and cyber insurance to SMEs in Europe, Cyber Incident Response and strategic advice in board rooms.

ESC - Enterprise Security Center

ESC - Enterprise Security Center

ESC is a system house specializing exclusively in IT security - Security Implementation & Optimization, Operations, Managed Security Services.

VanishID

VanishID

VanishID (formerly Picnic) is a gritty, pioneering team of intelligence and cybersecurity specialists focused on solving the security challenge of our time - social engineering.

Zyston

Zyston

Zyston's solutions provide end-to-end management of your cybersecurity needs. Our range of services help protect your business where it needs it the most.

Positiwise Software Pvt Ltd

Positiwise Software Pvt Ltd

Positiwise Software offers end-to-end software development solutions to accelerate the digital growth of businesses.

Ivolv Cybersecurity

Ivolv Cybersecurity

Ivolv is here to assist your organization in building effective protection and resilience against cyber attacks.

Innerworks

Innerworks

Innerworks intelligent bot detection. Innerworks is building the future of behavioural data on web3.

BreachRx

BreachRx

BreachRx is the first intelligent incident response management platform that provides operational resilience for the entire enterprise.