How attackers use patience to push past AI guardrails

How attackers use patience to push past AI guardrails

Most CISOs already assume that prompt injection is a known risk. What may come as a surprise is how quickly those risks grow once an attacker is allowed to stay in the conversation. A new study from Cisco AI Defense shows how open weight models lose their footing over longer exchanges, a pattern that raises questions about how these models should be evaluated and secured.

open weight model security

The researchers analyzed eight open weight large language models using automated adversarial testing. After the first reference, the researchers compared single turn and multi turn outcomes and found the same pattern across the entire set.

The jump from single turn to multi turn

Multi turn attacks work far more often than single turn attacks. Single turn attacks only succeeded around the low teens on average. Multi turn attacks climbed above sixty percent on average. One model type reached 92.78 percent.

The researchers tested 1024 single turn prompts for each model, then ran nearly five hundred adaptive conversations for the multi turn stage. The dataset included malicious code prompts, extraction attempts, manipulation prompts, and other common adversarial strategies. The procedures were the same for every model.

The difference showed up before the analysis even got to threat categories. When attackers can adjust their approach based on earlier responses, models tend to follow along. The report shows how an attacker can start with harmless requests and then slowly introduce more dangerous instructions. This mirrors what happens in real interactions, which can last minutes or even longer.

How alignment strategy shapes model behavior

The study also highlights a link between model provenance and multi turn vulnerability. Some developers emphasize capability and leave safety tuning to downstream teams. Others invest more time in alignment during model development. Those choices show up in the gaps between single turn and multi turn success rates.

In explaining this pattern, the researchers cite wording directly from the technical documents of several model developers. For example, one developer notes that downstream users are “in the driver seat to tailor safety for their use case.” Another developer describes its model as built with “rigorous safety protocols” and a “low risk level.” One model card even states that the model “does not have any moderation mechanisms.” These published details help explain why some models lost far more ground during multi turn testing.

The report does not suggest that any single strategy is wrong. It argues that CISOs and technical evaluators should understand what a lab prioritizes. A high capability model may be attractive for internal development work. It may also require more layers of protection before deployment.

Threat patterns worth watching

Across models, manipulation prompts, misinformation oriented prompts, and malicious code prompts were among the categories with the highest multi turn success rates. The researchers also break out subthreats and note that the top fifteen had especially high results. The exact numbers varied by model. The common thread was consistency. Multi turn strategies in these categories worked far more often than one shot prompts.

The report calls attention to rising use of strategies that blend misdirection, information breakdown and reconstruction, and role play. These approaches did not always work on the first try. They worked often once the attacker had room to maneuver.

One example described shows a multi step process where an attacker starts by asking what looks like an ordinary question about scripting, then adds detail in later turns until the model generates harmful code. No single prompt looks dangerous in isolation. The entire conversation produces the outcome.

Enterprise teams need to watch out

A model that looks solid under single turn testing can behave very differently across a longer thread. The report notes that single turn results can give a false sense of security. Multi turn testing gives a more realistic picture of how an attacker will behave.

The researchers recommend layered protections, stronger system prompts, context aware filtering, and ongoing assessments that include multi turn simulations. They also warn against connecting model output directly to automated systems without strict controls.

How attackers use patience to push past AI guardrails

Report: AI in Identity Security Demands a New Playbook



Source link