The Cause Of Amazon’s Cloud Outage

Amazon Web Services (AWS) has explained the cause of their outage, which took down thousands of third-party online services for hours. Amazon say that, “the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration... As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.” 

While dozens of services were affected, AWS says the outage occurred in its Northern Virginia, US-East-1, region. It happened after a "small addition of capacity" to its front-end fleet of Kinesis servers. 

Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services and these services also saw impact during the shutdown. Kinesis is used by developers, as well as other AWS services like CloudWatch and Cognito authentication, to capture data and video streams and run them through AWS machine-learning platforms.  

The Kinesis service's front-end handles authentication, throttling, and distributes workloads to its back-end "workhorse" cluster via a database mechanism called sharding.  

Amazon’s additions to capacity triggered the outage but wasn't the root cause of it. AWS was adding capacity for an hour after 2:44am PST, and after that all the servers in Kinesis front-end fleet began to exceed the maximum number of threads allowed by its current operating system configuration.  The first alarm was triggered at 5:15am PST and AWS engineers spent the next five hours trying to resolve the issue. Kinesis was fully restored at 10:23pm PST. 

Amazon explains how the front-end servers distribute data across its Kinesis back-end: "Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map." According to AWS, that information is obtained through calls to a micro service vending the membership information, retrieval of configuration information from DynamoDB and continuous processing of messages from other Kinesis front-end servers. For Kinesis communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants." 

As the number of threads exceeded the OS configuration, the front-end servers ended up with "useless shard-maps" and were unable to route requests to Kinesis back-end clusters. AWS had already rolled back the additional capacity that triggered the event but had reservations about boosting the thread limit in case it delayed the recovery.  

As a first step, AWS has moved to larger CPU and memory servers, as well as reduced the total number of servers and threads required by each server to communicate across the fleet.  It's also testing an increase in thread count limits in its operating system configuration and working to "radically improve the cold-start time for the front-end fleet".  

CloudWatch and other large AWS services will move to a separate, partitioned front-end fleet. AWS is also working on a broader project to isolate failures in one service from affecting other services.  

AWS has also acknowledged the delays in updating its Service Health Dashboard during the incident, but says that was because the tool its support engineers use to update the public dashboard was affected by the outage. During that time, it was updating customers via the Personal Health Dashboard.   Amazon has apologised for the impact this event caused its customers.

Amazon:        Down Detector:       ZDNet

You Might Also Read:

The Risks &  Benefits Of Cloud Security:

 

« We Live In A Transient Internet
Orca Security Wants To Streamline Cloud Computing »

Infosecurity Europe
CyberSecurity Jobsite
Perimeter 81

Directory of Suppliers

The PC Support Group

The PC Support Group

A partnership with The PC Support Group delivers improved productivity, reduced costs and protects your business through exceptional IT, telecoms and cybersecurity services.

Clayden Law

Clayden Law

Clayden Law advise global businesses that buy and sell technology products and services. We are experts in information technology, data privacy and cybersecurity law.

XYPRO Technology

XYPRO Technology

XYPRO is the market leader in HPE Non-Stop Security, Risk Management and Compliance.

ManageEngine

ManageEngine

As the IT management division of Zoho Corporation, ManageEngine prioritizes flexible solutions that work for all businesses, regardless of size or budget.

Alvacomm

Alvacomm

Alvacomm offers holistic VIP cybersecurity services, providing comprehensive protection against cyber threats. Our solutions include risk assessment, threat detection, incident response.

Dark Reading

Dark Reading

Dark Reading is the most trusted online community for security professionals.

Japan Information Security Audit Association (JASA)

Japan Information Security Audit Association (JASA)

JASA is non-profit association active in developing and managing the quality of Information Security Auditing and Auditors in Japan.

Grupo CFI

Grupo CFI

Grupo CFI is the largest Spanish network of data protection and cybersecurity professionals.

Ensurity Technologies

Ensurity Technologies

Ensurity is a deep-tech cybersecurity engineering company; designs and manufactures specialized secure hardware, software, and mobile application solutions.

Switchfast Technologies

Switchfast Technologies

Switchfast Technologies is an IT consulting and managed services provider, offering IT support and consulting to Chicagoland small businesses.

Soteria

Soteria

Soteria is a global leader in the development, integration and implementation of advanced cyber security, intelligence and IT solutions, delivering complete end-to-end solutions.

Across Verticals

Across Verticals

Across Verticals is a boutique cyber security consulting firm that specializes in holistic, deeply technical and end to end cyber security advisory services based on industry best practices.

META-Cyber

META-Cyber

META-cyber was founded by engineers with experience in process and control-protection to provide cyber security for industrial infrastructure.

Bluewave

Bluewave

Bluewave are a strategic IT advisory company that offers businesses a simple and comprehensive way to purchase information technology solutions.

US Department of State - Bureau of Cyberspace & Digital Policy

US Department of State - Bureau of Cyberspace & Digital Policy

The Bureau of Cyberspace and Digital Policy leads and coordinates the Department’s work on cyberspace and digital diplomacy to encourage responsible state behavior in cyberspace.

ANY.RUN

ANY.RUN

ANY.RUN is an interactive online malware analysis service created for dynamic as well as static research of multiple types of cyber threats.

AgilePQ

AgilePQ

AgilePQ visibly secures IoT devices worldwide to protect the privacy, safety, and well-being of all people.

CYTUR

CYTUR

CYTUR provide trusted and secured maritime cybersecurity solutions to keep ships safe, protecting them, their crews, cargo and all stakeholders from maritime cyber threats.

HEAL Security

HEAL Security

HEAL Security is the global authority for cybersecurity data, research and insights across the healthcare sector.

ThreatCaptain

ThreatCaptain

ThreatCaptain is a Cybersecurity Leadership Development Company driven to enhance and illuminate cybersecurity risk through strategic alignment and informed business decision-making.

Black Belt Secure

Black Belt Secure

We provide critical cybersecurity services such as managed security, ransomware mitigation, penetration testing, system auditing and compliance services to your organization.