Datacentre outages are happening less often and becoming less severe


The frequency and severity of datacentre-related outages are going down, according to resiliency think-tank Uptime Institute.

The organisation’s sixth annual outage analysis report says the number of datacentre-related crashes has been declining for several years, despite marked increases in the number of facilities in operation across the world.

“More than half (55%) of operator respondents to the 2023 Uptime Institute datacentre survey reported having an outage in the past three years – down from 60% in 2022 and 69% in 2021,” the report said.

“At the same time, only one in 10 outages in 2023 was categorised as either serious or severe. This is an improvement of four percentage points from the 2022 response and an improvement of 10 percentage points compared with 2021.”

The findings are based on a mix of publicly available outage data, coupled with responses from participants to Uptime’s annual global survey of datacentre managers and its resiliency survey, combined with the anonymised accounts of its members and partners.

The report acknowledges that each of its data sources has limitations, due to differences in how companies and individuals define an outage, and the level of detail they record regarding the nature and duration of each event.

Even so, all this data points towards the fact that datacentre service reliability levels are improving, despite operators having to face challenges on multiple fronts that could jeopardise the uptime and availability of their sites.

“Decades of innovation, investment and improved management have significantly increased the reliability of critical IT systems, networks and datacentres,” the report said. “However, operators are also facing new challenges from increased demand, the adoption of software-based optimisation techniques and a growing number of cyber threats.”

It added: “Adverse weather events, which are increasing in both intensity and frequency due to climate change, and the use of more renewable energy in the power grid have added further risks.”

In terms of the factors responsible for this apparent downturn in datacentre crashes, Uptime said the trend can be attributed to a “range of measures” that operators are taking.

“Greater investment, the combined effects of software-based resiliency and on-site physical redundancy, improved training, the outsourcing and greater professionalism of some third-party operators, and overall continuing vigilance [are all factors],” the report said.

However, Uptime did sound a note of caution regarding emerging factors that could result in this trend going into reverse, including increasing network and system complexity, and the growing adoption of distributed architectures.

“[This] aims to mitigate localised failures,” the report said. “However, Uptime data suggests this shift may play a role in the increase in network, software or system-related incidents.”

There is also an “ongoing challenge” around recruiting and training staff so human error-related outages can be reduced, and establishing “proven management processes” so downtime incidents can be avoided entirely.

Additionally, other external risk areas that operators need to be mindful of as well, according to Uptime, include the stability of energy grids and climate change. Although the report conceded there is little operators can do to tackle these threats directly, there are steps they can take to reduce their exposure.

“In summary, outage prevention requires ongoing vigilance and investment – and currently, the digital infrastructure industry is on an improving trajectory,” the report said.

“Robust datacentre design, detailed attention to IT architectures and topology, physical infrastructure redundancy, testing, improved training and continuous review will continue to be necessary if this is to be maintained.”



Source link