CISOOnline

AI cyberattackers are getting better faster

The ability of AI models to perform end-to-end, multi-stage penetration tests that match the capabilities of humans undertaking the same tasks has improved dramatically in recent months, according to new benchmarks published by the UK government’s AI Security Institute (AISI).

In November 2025, the difficulty of cyber tasks the best models could complete was doubling every eight months, according to AISI, a research organization within the Department for Science, Innovation and Technology (DSIT).

By February this year, the performance improvements had accelerated, with the difficulty of the tasks AI models could complete doubling every 4.7 months, and since then the latest Claude Mythos Preview and GPT-5.5 models are showing even greater capability, AISI said.

The time horizon benchmarks used by AISI first measure or estimate the time it would take a human expert to solve a variety of challenges as a proxy for their difficulty and then estimate the longest task (in human work hours) that AI models can complete with a success rate of 80%. This makes it a measure of autonomous capability rather than speed: If a human can successfully complete a set of pen testing tasks in 4 hours, time horizon testing measures how successfully an AI model can match this capability at a given reliability.

To achieve this, the AI must sustain performance over multiple steps while maintaining context and recovering from failures. The more steps, the more difficult pen testing becomes, and the more meaningful the results.

As with all benchmarks, there are caveats. The first is that to compare performance between models over time, the testing capped the AI systems at a low 2.5 million tokens. This has a number of effects including, in these benchmarks, limiting the ability of the AI models to keep track of what they were working on at an earlier stage.

As AISI said in its analysis, “They are inexact predictors of performance; AI struggles with some tasks humans do quickly, and easily completes others that humans find hard. However, we use this type of benchmark because it offers a measure of AI autonomy from which we can draw trends.”

Growing risk

The research is cause for concern for the UK government.

“Our independent testing shows that cyber capabilities in leading AI systems are advancing much faster than we expected. That matters because this isn’t theoretical — those advances are already starting to translate into real risks for organisations, especially those with weak cyber defences,” UK AI Minister Kanishka Narayan said via email.

“These tools can also help cyber security teams spot and fix weaknesses faster. The UK is leading the way in testing and understanding frontier AI, and that capability is only going to become more important as the technology continues to move at pace,” he added.

In April, DSIT Secretary of State Liz Kendall and Security Minister Dan Jarvis posted an open letter warning businesses of the growing cyber security risks posed by AI models.

What’s clear is that the capabilities of AI models under real-world scenarios are rapidly improving and, on the evidence of the recent AISI evaluation of Claude Mythos Preview, are probably accelerating.

Not all recent benchmarking of AI’s abilities to solve difficult problems has delivered such impressive results. In a recent test of 19 AI models against a range of tasks including coding, crystallography, genealogy and music sheet notation, researchers at Microsoft found the models could be error-prone and unreliable, especially for longer tasks.

Kat Traxler, principal security researcher at Vectra AI, sees the benchmarks as a useful signal that enterprises should pay attention to. “The AISI benchmarks don’t measure if models can spot a flaw. Rather, they measure whether various models can chain together a series of exploits into working attacks to achieve an end goal, like a real-world attackers do. As a signal of offensive capability, AISI’s results carry real weight,” she said.

However, she pointed to a recent Xbow evaluation of Claude Mythos that found mixed performance at some tasks. “How these known model limitations will actually limit real-world autonomous offensive campaigns is still being determined, but it does point to the need for a sophisticated validation harness to truly see the ceiling of model capabilities.”

According to Chris Lentricchia, director cloud and AI security strategy at Sweet Security, enterprises should also look at the upside — AI models aid attackers, but also defenders.

“This is not purely an offensive story. The same acceleration improving attacker capability can also improve defensive capability in areas like proactive threat detection and response automation. Benchmarks are best viewed as indicators for understanding whether enterprise defenses are evolving fast enough to keep pace with accelerating AI capability,” said Lentricchia.



Source link