Researchers from the University of Melbourne and Imperial College London have developed a method for using LLMs to improve incident response planning with a focus on reducing the risk of hallucinations. Their approach uses a smaller, fine-tuned LLM combined with retrieval-augmented generation and decision-theoretic planning.
The three steps of the method for incident response planning
The problem they target is familiar: incident response is still largely manual, slow, and reliant on expert-configured playbooks. Many organisations take weeks or even months to fully recover from an incident. While some have experimented with frontier LLMs to generate response actions, these models are costly, depend on third-party APIs, and are prone to generating plausible but incorrect instructions.
Kim Hammar, one of the authors of the paper, said the system was designed to avoid heavy integration hurdles. “From a technical perspective, our method is deliberately designed to integrate directly into existing workflows without requiring any additional software or adaptation of existing systems,” he explained. “In particular, our method takes as input log data and threat information in raw textual form. This text does not need to follow any specific syntax or format.”
Three-step approach
The method works in three main steps:
Instruction fine-tuning: The team fine-tunes a 14-billion-parameter LLM on a dataset of 68,000 historical incidents, each paired with response plans and reasoning steps. This aligns the model with the phases and goals of incident response without locking it to any single scenario.
Information retrieval: Before generating a plan, the system pulls relevant threat intelligence and vulnerability data based on indicators found in system logs. This allows it to adapt to new threats, such as vulnerabilities discovered after the training cut-off, and grounds the model’s output in up-to-date information.
Planning with hallucination filtering: Instead of executing the first suggested action, the system generates multiple candidates and uses the LLM to simulate potential outcomes. It selects the action predicted to lead to the fastest recovery, using this lookahead to filter out responses that do not make progress.
From a user’s point of view, Hammar said the method can act like a more adaptive playbook. “It should integrate into existing workflows that rely on response playbooks,” he noted. “Security operators should treat the suggested actions as guidance to be validated against available evidence, rather than as the ground truth.”
Theoretical and practical outcomes
The paper provides a probabilistic analysis showing that the likelihood of hallucination can be bounded. The probability can be made arbitrarily small if the planning process is allowed more time and candidate actions. This offers a formal basis for the claim that the approach is more reliable than prompt-only frontier LLMs.
On the practical side, the method is lightweight enough to run on commodity hardware, removing the need for costly API calls or specialised infrastructure. The authors evaluated their system against several frontier LLMs and a reinforcement learning baseline using publicly available incident datasets. Across all tests, it achieved shorter average recovery times, up to 22 percent faster than the best-performing frontier model tested, while also reducing ineffective actions and failed recoveries.
According to Hammar, the local, self-contained nature of the system also addresses confidentiality and compliance concerns. “A key benefit of our lightweight method is that it runs locally without relying on external LLM providers,” he said. “This flexibility reduces costs and avoids the need to upload potentially sensitive log data to third-party LLM providers.”
An ablation study confirmed that all three steps contributed to performance gains, with fine-tuning and planning providing the largest improvements. Retrieval-augmented generation also helped, though its effect was smaller.
Trade-offs and considerations
While the approach avoids incident-specific retraining and improves reliability, it does have overhead. The planning step increases inference time because it requires generating and evaluating multiple actions. The authors note this can be mitigated with parallel processing.
The method is especially useful for situations where a quick reaction is critical and log data is complex. Hammar described one such use case: “It’s 2 a.m. and your SIEM detects a potential incident. Your on-call security operator is paged to identify the specific problem, determine the cause, and resolve it as quickly as possible. Instead of jumping between dashboards and manually tracing the incident across multiple applications and infrastructure layers, our LLM-based method helps interpret logs and suggests targeted response actions.”
On the other hand, he acknowledged that some scenarios would see less benefit. “Incidents where immediate action is not critical may benefit less from our method,” he said. “Highly novel or sophisticated attacks that require deep expert analysis may only benefit in the early stages of the response.”
Another key point is that the system is not intended to replace human judgment. Hammar sees the next few years as a period where human oversight will remain essential. “Completely autonomous incident response in the next few years is unrealistic because everyone’s network, attacks, security environments, and regulations differ,” he said. “Decision-support tools are steadily taking over tasks once done manually, shifting the operator’s role toward guiding and validating the system rather than sifting through large volumes of logs and security alerts.”
The team has released their fine-tuned model, training data, code, and a demonstration video as open-source resources. This enables further experimentation and operational trials. They see future work in testing the method in real-world SOC workflows, refining the theoretical hallucination bounds, and extending the planning process to use more advanced search techniques.
If proven in operational settings, the approach could give security teams a more responsive and cost-effective way to triage and contain incidents without relying on expensive frontier LLMs or rigid playbooks.
Source link