Font Size: A A A

Interviews

A new structure for resilience

Resilience should not be a specialist concern, says Flick March, UKI Security and Resiliency Practice Leader at Kyndryl, an IT service management company.

From a top-down perspective, a business’s resilience strategy can look relatively straightforward in concept, even if the difficulty of actually executing it is well understood. The thought process involved often starts with the fact that, thanks to digital commerce, companies of all stripes now need to run 24/7 services, and so potential Achilles’ heels in the IT infrastructure need to be identified and addressed. The costs of failing to do so are witnessed in the headlines on a regular basis – when banks, supermarkets, and other major organisations attract the ire of their customers by slipping out of availability – such that executives are keenly aware of the relationship between resilient systems and business success.

So, key metrics are established with an eye to making sure that data is always available, secure, and usable; everything from core databases to customer service operations require contingency plans; and security and uptime become requirements baked into procurement processes. In short, the response to the risk of disaster is to make that risk explicit and so empower teams to respond to it.

However, while the decision to target resilience in this way is understandable and even laudable, it misses much of the nuance in how these systems actually work, and can ultimately be counterproductive. To understand why, we need to think briefly about the history of IT as a discipline.

The rise of the specialist

Looking back now, it’s impressive how deeply and quickly IT has specialised as a professional field. A generation or two ago, information technology barely existed as a career option; today, not only is it one of the world’s major fields of employment, but workers within it are called to take on incredibly specific roles and skills to meet the demand. No matter how minor and integrated a technology might seem, there are whole teams for whom it is their full-time concern.

It’s hardly surprising, then, that words and ideas take on a different and specific meaning in different areas of the profession. Usability means something different to a support desk worker or a data engineer; speed has different metrics in an office block or in a high-performance computing centre; and security has different implications for a networking specialist or for a cryptographer. And yet, these are all requirements which can come down from a business strategy perspective as though they are identical and interchangeable.

Thinking again about that straightforward conceptualisation of resilience, then, the problem quickly becomes apparent. In this model, when attacks or failures happen, an IT professional’s first reaction is not how it impacts the business, but how it relates to their own definition of resilience, which they have been targeted on as a metric of successful performance. Rather than identifying the root of the problem, teams attend to their immediate area, whether that be restoring database uptime, keeping order fulfilment going, protecting internal lines of communication, or something else.

While this kind of specific responsibility-taking has its upside in emergency situations, it also means that following the thread of a problem through different areas of IT with different internal metrics takes time – and taking more time, of course, means suffering greater reputational and financial damage.

This should not, to be clear, be taken as a criticism of the diligence of IT workers or a call to turn back the clock on IT specialisation; the issue is a broader, structural one. As IT professionals have specialised over the decades, so has the way in which we sell, buy, and connect IT infrastructure. In order to improve manageability, budgets are allocated – and vendors sell their offerings in terms of – segmented specialisms.

This comes at the expense of a cohesive view of the business impact of failures because it encourages work to focus on local metrics like service-level agreements or objectives and key results, rather than focusing on the reasons why the technology is needed in the first place. To put it in a plain example, the problem with losing contact with a data centre is not the inability to access and update information – it is that customers cannot place orders.

What we need to do, therefore, is rethink what preparing for and measuring resilience looks like within these disciplines. SLAs are attractive because they are native to the technology they measure, but they do not necessarily correspond to real-world consequences. Instead, we need to start thinking back from minimum viable business function requirements and asking how the whole IT estate, together, can work to support them.

There are many facets to that change, from understanding (as many businesses do not) which systems comprise the minimum viable organisation, to testing (as many businesses should) how a single point of failure might have consequences across the organisation. At the highest level, though, it comes down to making sure that there is a generalist, holistic view of resilience in place which can think in terms of what will actually matter to the business, not just in terms of resilience as an abstract ideal.


Tags

Related News