Persistent challenges in adhering to established incident management processes pose a significant risk to organizations, amplifying potential downtime costs amidst a surge in service incidents, according to Transposit.
Despite a majority of respondents (59.4%) who have a defined incident management process in place and a level of automation that meets their needs (71.1%), organizations grapple with a surge in service incidents and still struggle to quickly resolve them.
66.5% of organizations reported an increase in the frequency of service incidents that have affected their customers over the past 12 months, a 3.6% increase from the 2022 survey.
These downtime-producing incidents (e.g., application outages, service degradation) are putting organizations at risk of losing up to $499,999 per hour on average, according to 63% of respondents — a nearly 5% increase from 2022. 46.6% also said downtime can cost anywhere from $100,000 to $2 million.
Organizations find current incident management ineffective
Research points to generative AI as a means to resolve the incident management paradox with 84.5% who either believe AI can significantly streamline their incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.
“The insights unearthed in our research underscore the pressing need for adaptive, LLM-based automation that transcends mere task repetition and, instead, dynamically adapts to evolving circumstances by assimilating cues and context in real-time,” said Divanny Lamas, CEO of Transposit.
“Traditional, rule-based automation tools are no longer sufficient for the demands of modern operations teams. Despite robust incident management processes within numerous organizations, the relentless surge in service incidents — with its consequential impact on customers and financial ramifications — mandates a transformative approach. The path forward lies in harnessing innovative solutions like generative AI, augmented by automation and guided by human judgment, to not only expedite incident resolution but also proactively detect and preempt potential issues before they escalate.”
In the domain of incident management, reliability engineering teams face significant hurdles. 73.9% of those responsible for reliability engineering experience challenges while trying to solve incidents, including brittle automation scripts (59.7%), too many manual processes (47.8%), and difficulty accessing specialized knowledge (47.2%).
Moreover, 42.5% of organizations said their current incident management process is not effective or is only being used by some team members due to confusing documentation (41.3%), limited access to tools (40.4%), and reliance on institutional knowledge (39.7%).
61.5% of organizations also cited an increase in the amount of time it takes to resolve incidents over the course of the last year, with 79.8% saying it takes up to six hours on average to resolve incidents from the first alert to mitigating the issue. Beyond the extended incident resolution time, there’s an added layer of complexity in assembling the right team members, as indicated by 71.3% who reported this process can take up to 30 minutes.
Adding to this, a significant portion of team members find it challenging to grasp and routinely apply the organization’s defined procedures. 37.4% of organizations report that only select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.
Automation hurdles add to service incident complexity
Organizations grapple not only with inefficiencies in incident resolution but also encounter hurdles in implementing automation. 33.3% of respondents cited only 11-25% of their incident management tasks or workflows are automated, showcasing an opportunity for more automation in organizations’ incident management processes.
Delving deeper, respondents expressed keen interest in automating pivotal aspects of the incident lifecycle, such as incident setup (50.0%), communication protocols (44.2%), investigative processes (30%), and remediation (29%).
Despite the interest in implementing automation, respondents cited these top four barriers to achieving it:
- There is not enough buy-in from leadership or management (57.1%)
- Share of knowledge is not enough (54.3%)
- Inadequate documentation of institutional knowledge and existing processes (54%)
- Lack of clarity about what to automate (52.4%)
When using SaaS tools, organizations are able to more quickly create automations. 74.6% of respondents embraced SaaS tools, with 82.0% confirming their ability to create automations without coding. 84.3% reported spending just 11 minutes to an hour, underscoring the efficiency of SaaS solutions in incident management.
Organizations enhance tech stack with AI-based applications and automation tools
Over the next 12 months, 72.1% of teams expect to expand their tech stack. To strengthen their incident management process and decrease mean time to resolution/repair (MTTR), organizations plan to implement new tools, including:
- AI- or ML-based tools or applications (60.0%)
- Automation tools or applications (53.1%)
- Communication/collaboration tools or applications (48.1%)
SRE and platform engineering play a vital role in implementing AI and automation. Over the past year, 61.5% increased their focus on SRE practices, intending to hire more site reliability engineers, while 57.5% enhanced platform engineering efforts, planning to bring in more platform engineers. These strategic moves highlight organizations’ dedication to fortifying their incident management capabilities.
Findings illuminate a clear path forward for the incident response lifecycle, emphasizing the need for a SaaS tool or platform that seamlessly integrates all of the incident management tools organizations use, leverages human data insights, and harnesses generative AI to bolster operational efficiency and decision-making.
AI reshapes work experience
90.4% of respondents believe that systematically mining insights from human data (such as archived Slack communications, retrospective interviews, group feedback, etc.) could improve future incident response and improve operational excellence. However, 90.2% agree automation should let humans use their judgment at critical decision points to be more reliable and effective, a nearly 10% increase from the 2022 study.
Integrating generative AI capabilities into incident management tools or platforms was found by 89.8% as a way to decrease the time it takes to create new automations, freeing time for other high-value work. 96.3% believe it would be beneficial if all of the tools their organization used during an incident were integrated through one tool or platform.
For the 79.5% of organizations that have embraced AI in their tech stack, the impact is significant. 51% feel AI is making their job better, showing an improving work life for humans. 63.5% use it to improve the accuracy and quality of data. 50.7% report faster time to incident resolution. 49.4% use it to more quickly and easily identify the root cause of issues, potential threats, and vulnerabilities. 48% use it to automate repetitive tasks or processes, streamlining their operations effectively
Lamas concluded, “In light of the evolving demands placed on modern ops teams, it becomes evident that what these teams require is an adaptive, LLM-based automation and incident management solution. This unified, intelligent approach goes beyond streamlining processes; it empowers teams to leverage automation and AI to enhance their organization’s incident management processes and develop more efficient automated workflows. By ensuring that humans remain actively engaged in the process, this approach becomes increasingly vital for seamless incident resolution and a reduction in MTTR. Ultimately, it enables teams to concentrate their efforts on what truly matters — delivering efficient and effective solutions to complex problems.”