CISOOnline

AI models more vulnerable than claimed when faced with iterative attacks

“The dominant safety benchmarks for frontier large language models share a structural assumption: that a single prompt and a single model response are enough to characterize how a model behaves under adversarial attack,” the Cisco researchers who authored the study said in a blog post. “These benchmarks inform model cards, safety reports, and procurement decisions across the industry, but they all only measure one narrow slice of attacker behavior.”

Instead, the researchers subjected 15 of the most widely used frontier AI models to a variety of attack techniques that are more likely to occur in the real world, where attackers will not give up after the model refuses to respond to one malicious prompt.

“Real adversaries iterate,” the researchers said. “They reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually. A single turn benchmark cannot see any of that.”

Stress-testing over multiple prompts

The tests pitted various model configurations, such as with reasoning enabled or disabled, against a range of attack strategies aimed at bypassing safety guardrails. Techniques included role-play; misdirection or introducing ambiguity into the context; redirection or reframing the model’s refusal; information decomposition and reassembly; and incremental escalation, by breaking a task into smaller parts that don’t seem malicious on their own.



Source link