SecurityWeek

When Information Becomes the Attack Surface – Understanding AI Agent Traps


AI agents go beyond answering questions. They can autonomously browse websites, read emails, search company files, query software tools, and more. AI models producing incorrect answers is hardly a threat, until agents encounter information that’s maliciously designed to influence what it sees, believes, remembers, or executes.

An agent leverages webpages, document stores, wikis, images, emails, or tools to produce intended outputs. But what happens when these sources mask malicious instructions? These trap AI agents into making a wrong interpretation or taking unintended action. Scientists from Google DeepMind categorized these “traps” into six categories, including content injection, semantic manipulation, cognitive state, behavioral control, systemic, and human-in-the-loop traps. The last two are more theoretical and expected to become more relevant as AI agent use grows. It helps to understand these traps to determine the necessary mitigations.

Content Injection: When Instructions Hide in Plain Sight

Content injections exploit the difference between what a human sees and what an agent parses, as well as the system’s difficulty in keeping trusted instructions separate from untrusted external data.

A webpage might appear harmless, but its underlying code, metadata, hidden text, or image can contain malicious instructions for an AI system. An AI model accepts attacker-controlled data from an external source, such as a website or file. If this system fails to distinguish between data and instructions, the model may start processing instructions within that content. The objective behind such injection of malicious content is to alter the AI’s response, disclose sensitive information or enable an unauthorized action. In NIST evaluations of agent hijacking, malicious instructions succeeded across five tested injection tasks, on average, 57% of the time.

A support ticket with underlying malicious instructions can manipulate an AI agent into retrieving customer data from the CRM and sending it to an attacker-controlled address. If the agent has excessive permission, this exfiltration becomes all the easier.

Semantic Manipulation: Shapeshifting the Information

Semantic manipulation need not explicitly tell the agent what to do; it feeds repetition, emotional language, selective context, a false sense of authority, and coordinated claims to the agent to skew context and guide the agent towards the ‘attacker preferred’ conclusion.

Advertisement. Scroll to continue reading.

Imagine a scenario where you have tasked an agent to zero in on a supplier. It comes across search results that repeatedly extol the virtues of a specific supplier, describe a specific company as the gold standard, highlight its strengths and amplify doubts about competitors. This increases the chances of the agent recommending this supplier. Conventional signature-based security tools may not flag anything malicious, as the attacks leverage ‘reasoning’ to influence rather than rely on malicious code.

Here, manipulation of the surrounding information environment becomes the manipulation of the decision itself.

Cognitive State Traps: Poisoning Agent Knowledge

Some agent systems use retrieval databases, interaction histories, or persistent memory stores to maintain context and continuity across tasks. This creates an opportunity for poisoned information to influence later outputs or actions. E.g., a poisoned document in a shared repository that an agent refers to and trusts as evidence, or a manipulated exchange that becomes an agent’s memory, only to rear its head during future tasks.

Research presented at the USENIX conference found that, in controlled tests, inserting five specially crafted texts per target question caused a RAG system to produce the attacker’s chosen answer in about 90% of cases, even when its knowledge base contained millions of legitimate texts.

With information governance becoming an integral component of AI security, organizations must be aware of which sources agents retrieve information from, who can modify those sources, how claims can be verified, and whether stored memories can be reviewed or removed.

Behavioral Control: Turning Influence into Action

Behavioral control operates at the juncture where interpretation is translated into action. Malicious content may attempt to make the AI agent send data, approve a transaction, execute code, invoke another tool or trigger a myriad of other actions. Here, the extent of the consequence depends on the extent of the agent’s access. Grant the agent only the data access and tool permissions required for the specific task. This could be the difference between an agent delivering a misleading summary and the same agent reading confidential files and communicating this information externally, resulting in data loss.

The More Theoretical Frontier

Systemic traps and human-in-the-loop traps remain less developed, but they deserve attention. Systemic traps could induce many similar agents to behave in correlated ways, causing congestion, market disruption, or cascading failures. Human-in-the-loop traps could use a compromised agent to mislead the person expected to approve its actions.

These risks may become more plausible as agent populations grow and users become accustomed to trusting agent-generated summaries.

Control for Agent Traps

A single control won’t alleviate the agent trap threat. A defensive framework must have aspects like source verification, content screening, memory governance, restricted permissions, isolated execution, monitoring, and an independent approval framework with a human in the loop for high-impact actions. Security must follow authority, and there should be clear lines of separation between the ability to interpret and the authority to act.

The future of agentic AI use will depend not only on what these agents can do but also on how they decide what to trust. The fact that they can complete a task is not up to doubt, but they must be able to recognize when the environment they are operating in and harnessing is trying to manipulate them.

Related: Agentic AI Security: Wrong Context, Wrong Decisions at Machine Speed

Learn More at the AI Risk Summit | Ritz-Carlton, Half Moon Bay



Source link