Failure Happens, But Recovery Can Be Managed Intelligently

The massive power outage across Spain and Portugal in April was a reminder that tech failure can happen everywhere and in a variety of ways. Digital services are fully integrated with modern-day life, and these complicated, interconnected operational technologies are prone to failure. 

This can create a chain of issues taking significant time to remedy. If one link in the chain breaks, customers, employees, and vendors can feel the sting.

We rely on technology for payment transactions, energy supply, digital banking, live customer support and a range of online services. But this technology is complex, independent, and often reliant on open source dependencies and third-party services, inevitably creating risk.

Following last July’s global IT outage, our research showed that the majority (86%) of executives now realize that they’ve been prioritising security at the expense of readiness for service disruptions. Being caught off guard is a mistake that can’t be afforded, and just one outage can create widespread disruption. Organisations looking to be on the front foot with IT operations need to first gain visibility of all digital processes, control them, and then pursue strategic automation. By modernising operations through testing, streamlined processes, and automation, leaders can work to reinforce both physical and digital infrastructure.

Downtime is not tied to any one region, industry or company, and it’s not a matter of ‘if’ you will experience a failure of some degree, but ‘when’. And, in many cases, the costs of major breaches can take years to recover.

Within very recent memory we’ve seen a global IT outage resulting from a simple developer error, UK supermarkets being cyberattacked, and a governmental report into the IT failings of the UK banking sector. Problems do happen to all organisations, so preparing for them isn’t a sign of weakness, it’s the only responsible way to show strength. Security in particular is not about prevention only, but also response, resolution and agility.

So You’ve Managed An Outage. What Comes Next?

After the immediate provisioning of services, with the business running and customers transacting and interacting again, there’s a chain of actions mature organisations take to review the situation and improve resilience.

  • Firstly, IT leaders must collaborate to revisit and assess how the organisation benchmarks on resilience across every department affected. For example, a common issue is for companies to find themselves improving legacy infrastructure and still falling behind where they want to be. If that rings true, assess if you are investing in modernisation and redundancy systems that meet all likely risks - not the ones from when legacy technology was first introduced. Additionally, the business must ensure its workflow and incidence management is robust, allowing for multiple notification methods so that systems can be managed and recovered no matter when, where, and how teams find they need to connect.
  • Secondly, update and review crisis management protocols. Establish a clear chain of command during emergencies and develop and regularly update emergency response plans. People and roles, subscriptions and passwords, and compliance requirements all change. So, revisit if your governance allows for resolution and action by responsible employees, including status notifications to customers and stakeholders.
  • Thirdly, as a follow-on, implement and practice your communication strategy. Maintain transparent communication with stakeholders and provide regular status updates to maintain public trust. Major organisations are scrutinised by the media, so it’s important to know what you can and can’t say, and what your required levels of confidence are in the information you share.
  • Fourthly, conduct a thorough post-incident analysis. There are best practices to follow that ensure this runs well, as incomplete information or unclear thinking and processes can severely reduce the effectiveness of this vital step. Business leaders require that their teams are learning from outages and mistakes such that they grow back stronger.

Without exaggeration, outages really can be a powerful tool for learning and growth. Take the time to really understand what happened, how it was handled and what can be learned from the experience.

Do so humbly, without blame, encouraging curiosity and structured, logical thinking that traces causes and effects from first to the second and third order.

  • Finally, embed artificial intelligence to enhance speed and productivity. Not all AI tools are the same, there are safer, less hallucinatory models than LLMs, though these can be used productively for summarisation to catch up on context and next best actions. Useful for teams in the thick of an incident or wading through many logs and reporting documents.

What’s more exciting and less risky is to deploy agentic capabilities for operations like SRE Ops, to give teams the ability to focus on the novel and not yet understood challenges that affect operations.

Putting it all together, better resilience and response to disruption require around equal parts critical thinking and culture and practices efforts as much as technology. And now, the ability to control and deploy AI without opening the business to downstream risk is another piece of that critical thinking puzzle. All of that and still using the power of AI and automation solutions to offer protocols that empower employees rather than stifling innovation.

Gaining that trust in solutions such that your teams can use them to fly safely is key and looks different for every organisation.

Eduardo Crespo is VP EMEA at PagerDuty

Image: Ideogram

You Might Also Read: 

Proven Strategies For Building Resilience In Data Backup & Recovery:


If you like this website and use the comprehensive 7,000-plus service supplier Directory, you can get unrestricted access, including the exclusive in-depth Directors Report series, by signing up for a Premium Subscription.

  • Individual £5 per month or £50 per year. Sign Up
  • Multi-User, Corporate & Library Accounts Available on Request

Cyber Security Intelligence: Captured Organised & Accessible


 

 

 

« How Ransomware's Industrialization Impacts SOC Operational Tempo
Pegasus Spyware Maker Fined »

Infosecurity Europe
CyberSecurity Jobsite
Perimeter 81

Directory of Suppliers

NordLayer

NordLayer

NordLayer is an adaptive network access security solution for modern businesses — from the world’s most trusted cybersecurity brand, Nord Security. 

Syxsense

Syxsense

Syxsense brings together endpoint management and security for greater efficiency and collaboration between IT management and security teams.

TÜV SÜD Academy UK

TÜV SÜD Academy UK

TÜV SÜD offers expert-led cybersecurity training to help organisations safeguard their operations and data.

MIRACL

MIRACL

MIRACL provides the world’s only single step Multi-Factor Authentication (MFA) which can replace passwords on 100% of mobiles, desktops or even Smart TVs.

Directory of Cyber Security Suppliers

Directory of Cyber Security Suppliers

Our Supplier Directory lists 8,000+ specialist cyber security service providers in 128 countries worldwide. IS YOUR ORGANISATION LISTED?

Miller Group

Miller Group

Miller Group is an IT managed service provider. We proactively monitor and manage your entire business computer network. Services include backup & recovery and cyber security.

CSI

CSI

CSI is a Managed Service Provider (MSP) delivering Hybrid Multi-Cloud, Data Protection, and Cyber Security solutions to highly regulated industries.

Open Information Security Foundation (OISF)

Open Information Security Foundation (OISF)

OISF is a non-profit organization led by world-class security experts, programmers, and others dedicated to open source security technologies.

Conference Index

Conference Index

Conference Index provides an indexed listing of upcoming meetings, seminars, congresses, workshops, summits and symposiums across a wide range of subjects including Cybersecurity.

Ten Eleven Ventures

Ten Eleven Ventures

Ten Eleven is a specialized venture capital firm exclusively dedicated to helping cybersecurity companies thrive.

Dutch Innovation Park

Dutch Innovation Park

Dutch Innovation Park in Zoetermeer is a breeding ground for applied IT solutions in the field of cyber security, e-health, smart mobility and big data.

C3i Hub

C3i Hub

C3i Hub aims to address the issue of cyber security of cyber physical systems in its entirety, from analysing security vulnerabilities to developing tools and technologies.

JFrog

JFrog

JFrog is on a mission to enable continuous updates through Liquid Software, empowering developers to code high-quality applications that securely flow to end-users with zero downtime.

Cranfield University

Cranfield University

Cranfield Defence and Security are at the forefront of their fields, offering capabilities ranging from cyber security and digital warfare to robotics, forensic sciences and simulation and analytics.

Grant Thornton

Grant Thornton

Grant Thornton is one of the world’s leading networks of independent assurance, tax and advisory firms.

Board of Cyber

Board of Cyber

Board of Cyber offers Security Rating: a fast, non-intrusive, continuous, 100% automated solution to evaluate the cyber performance of an organization.

RightSec

RightSec

RightSec is an emerging market leader and solution provider for cybersecurity and digital resiliency. We provide end to end solutions to suit your specific business lifecycle.

nodeQ

nodeQ

At nodeQ, we are pioneering the future of computer networks, leveraging our deep expertise in quantum communication, artificial intelligence, and software-defined networking.

Qi An Xin (QAX)

Qi An Xin (QAX)

QAX is a listed company based in China, and a leader in cybersecurity industry, providing new generation enterprise-level and national-level cybersecurity solutions.

INT3L

INT3L

The INT3L group (formerly Defentek) is a provider of national security and intelligence solutions, systems and services.

Tulpa AI

Tulpa AI

Tulpa develops safe AI assistants (co-pilots) to support and enhance human performance in high-stakes, mission-critical decision-making environments.