Failure Happens, But Recovery Can Be Managed Intelligently
The massive power outage across Spain and Portugal in April was a reminder that tech failure can happen everywhere and in a variety of ways. Digital services are fully integrated with modern-day life, and these complicated, interconnected operational technologies are prone to failure.
This can create a chain of issues taking significant time to remedy. If one link in the chain breaks, customers, employees, and vendors can feel the sting.
We rely on technology for payment transactions, energy supply, digital banking, live customer support and a range of online services. But this technology is complex, independent, and often reliant on open source dependencies and third-party services, inevitably creating risk.
Following last July’s global IT outage, our research showed that the majority (86%) of executives now realize that they’ve been prioritising security at the expense of readiness for service disruptions. Being caught off guard is a mistake that can’t be afforded, and just one outage can create widespread disruption. Organisations looking to be on the front foot with IT operations need to first gain visibility of all digital processes, control them, and then pursue strategic automation. By modernising operations through testing, streamlined processes, and automation, leaders can work to reinforce both physical and digital infrastructure.
Downtime is not tied to any one region, industry or company, and it’s not a matter of ‘if’ you will experience a failure of some degree, but ‘when’. And, in many cases, the costs of major breaches can take years to recover.
Within very recent memory we’ve seen a global IT outage resulting from a simple developer error, UK supermarkets being cyberattacked, and a governmental report into the IT failings of the UK banking sector. Problems do happen to all organisations, so preparing for them isn’t a sign of weakness, it’s the only responsible way to show strength. Security in particular is not about prevention only, but also response, resolution and agility.
So You’ve Managed An Outage. What Comes Next?
After the immediate provisioning of services, with the business running and customers transacting and interacting again, there’s a chain of actions mature organisations take to review the situation and improve resilience.
- Firstly, IT leaders must collaborate to revisit and assess how the organisation benchmarks on resilience across every department affected. For example, a common issue is for companies to find themselves improving legacy infrastructure and still falling behind where they want to be. If that rings true, assess if you are investing in modernisation and redundancy systems that meet all likely risks - not the ones from when legacy technology was first introduced. Additionally, the business must ensure its workflow and incidence management is robust, allowing for multiple notification methods so that systems can be managed and recovered no matter when, where, and how teams find they need to connect.
- Secondly, update and review crisis management protocols. Establish a clear chain of command during emergencies and develop and regularly update emergency response plans. People and roles, subscriptions and passwords, and compliance requirements all change. So, revisit if your governance allows for resolution and action by responsible employees, including status notifications to customers and stakeholders.
- Thirdly, as a follow-on, implement and practice your communication strategy. Maintain transparent communication with stakeholders and provide regular status updates to maintain public trust. Major organisations are scrutinised by the media, so it’s important to know what you can and can’t say, and what your required levels of confidence are in the information you share.
- Fourthly, conduct a thorough post-incident analysis. There are best practices to follow that ensure this runs well, as incomplete information or unclear thinking and processes can severely reduce the effectiveness of this vital step. Business leaders require that their teams are learning from outages and mistakes such that they grow back stronger.
Without exaggeration, outages really can be a powerful tool for learning and growth. Take the time to really understand what happened, how it was handled and what can be learned from the experience.
Do so humbly, without blame, encouraging curiosity and structured, logical thinking that traces causes and effects from first to the second and third order.
- Finally, embed artificial intelligence to enhance speed and productivity. Not all AI tools are the same, there are safer, less hallucinatory models than LLMs, though these can be used productively for summarisation to catch up on context and next best actions. Useful for teams in the thick of an incident or wading through many logs and reporting documents.
What’s more exciting and less risky is to deploy agentic capabilities for operations like SRE Ops, to give teams the ability to focus on the novel and not yet understood challenges that affect operations.
Putting it all together, better resilience and response to disruption require around equal parts critical thinking and culture and practices efforts as much as technology. And now, the ability to control and deploy AI without opening the business to downstream risk is another piece of that critical thinking puzzle. All of that and still using the power of AI and automation solutions to offer protocols that empower employees rather than stifling innovation.
Gaining that trust in solutions such that your teams can use them to fly safely is key and looks different for every organisation.
Eduardo Crespo is VP EMEA at PagerDuty
Image: Ideogram
You Might Also Read:
Proven Strategies For Building Resilience In Data Backup & Recovery:
If you like this website and use the comprehensive 7,000-plus service supplier Directory, you can get unrestricted access, including the exclusive in-depth Directors Report series, by signing up for a Premium Subscription.
- Individual £5 per month or £50 per year. Sign Up
- Multi-User, Corporate & Library Accounts Available on Request
- Inquiries: Contact Cyber Security Intelligence
Cyber Security Intelligence: Captured Organised & Accessible