Popular LLMs dangerously vulnerable to iterative attacks, says Cisco

Popular LLMs dangerously vulnerable to iterative attacks, says Cisco

Some of the world’s most widely used open-weight generative AI (GenAI) services are profoundly susceptible to so-called “multi-turn” prompt injection or jailbreaking cyber attacks, in which a malicious actor is able to coax large language models (LLMs) into generating unintended and undesirable responses, according to a research paper published by a team at networking giant Cisco.

Cisco’s researchers tested Alibaba Qwen3-32B, Mistral Large-2, Meta Llama 3.3-70B-Instruct, DeepSeek v3.1, Zhipu AI GLM-4.5-Air, Google Gemma-3-1B-1T, Microsoft Phi-4, and OpenAI GPT-OSS-2-B, engineering multiple scenarios in which the various models’ output disallowed content, with success rates ranging from 25.86% against Google’s model, up to 92.78% in the case of Mistral.

The report’s authors, Amy Chang and Nicholas Conley, alongside contributors Harish Santhanalakshmi Ganesan and Adam Swanda, said this represented a two to tenfold increase over single-turn baselines.

“These results underscore a systemic inability of current open-weight models to maintain safety guardrails across extended interactions,” they said.

“We assess that alignment strategies and lab priorities significantly influence resilience: capability-focused models such as Llama 3.3 and Qwen 3 demonstrate higher multi-turn susceptibility, whereas safety-oriented designs such as Google Gemma 3 exhibit more balanced performance.

“The analysis concludes that open-weight models, while crucial for innovation, pose tangible operational and ethical risks when deployed without layered security controls … Addressing multi-turn vulnerabilities is essential to ensure the safe, reliable and responsible deployment of open-weight LLMs in enterprise and public domains.”

What is a multi-turn attack?

Multi-turn attacks take the form of iterative “probing” of an LLM to expose systemic weaknesses that are usually masked because models can better detect and reject isolated adversarial requests.

Such an attack could begin with an attacker making benign queries to establish trust, before subtly introducing more adversarial requests to accomplish their actual goals.

Prompts may be framed with terminology such as “for research purposes” or “in a fictional scenario”, and attackers may ask the models to engage in roleplay or persona adoption, introduce contextual ambiguity or misdirection, or to break down information and reassemble it – among other tactics.

Whose responsibility?

The researchers said their work underscored the susceptibility of LLMs to adversarial attacks and that this was a source of particular concern given all of the models tested were open-weight, which in layman’s terms means anybody who cares to do so is able to download, run and even make changes to the model.

They highlighted as an area of particular concern three more susceptible models – Mistral, Llama and Qwen – which they said had probably been shipped with the expectation that developers would add guardrails themselves, compared with Google’s model, which was most resistant to multi-turn manipulation, or OpenAI’s and Zhipu’s, which both rejected multi-turn attempts more than 50% of the time.

“The AI developer and security community must continue to actively manage these threats – as well as additional safety and security concerns – through independent testing and guardrail development throughout the lifecycle of model development and deployment in organisations,” they wrote.

“Without AI security solutions – such as multi-turn testing, threat-specific mitigation and continuous monitoring – these models pose significant risks in production, potentially leading to data breaches or malicious manipulations,” they added.



Source link