Hackers Can Use Indirect Prompt Injection Allows Adversaries to Manipulate AI Agents with Content


Artificial intelligence tools are now a core part of everyday workflows — from browsers that summarize web pages to automated agents that help users make decisions online.

As these tools become more capable, attackers are learning how to turn them against the very people they are designed to serve.

A method called indirect prompt injection (IDPI) allows adversaries to embed hidden instructions inside ordinary-looking web content, tricking AI agents into executing commands they were never authorized to follow.

Unlike direct prompt injection, where a person types a malicious instruction directly into a chatbot, IDPI works entirely behind the scenes.

An attacker hides instructions inside a webpage — embedded in HTML code, user comments, metadata, or invisible text — and waits for an AI tool to visit or process that page.

When the AI reads the page as part of a routine task, such as summarizing content or reviewing an advertisement, it may unknowingly interpret those hidden instructions as legitimate commands and act on them.

Threat model depiction for web-based IDPI (Source - Unit42)
Threat model depiction for web-based IDPI (Source – Unit42)

Unit 42 researchers identified that this attack is no longer just a theory. Their analysis of large-scale real-world telemetry confirmed that IDPI attacks are actively deployed across live websites, with 22 distinct techniques documented for constructing malicious payloads.

Their findings also revealed previously undocumented attacker goals, including the first known real-world case of IDPI being used to bypass an AI-based advertisement review system.

Example of Hidden Prompt in Page from reviewerpress[.]com (Source - Unit42)
Example of Hidden Prompt in Page from reviewerpress[.]com (Source – Unit42)

The range of harm these attacks can cause is broad. Attackers have used IDPI to push phishing sites up in search rankings through SEO poisoning, attempt unauthorized financial transactions, force AI tools to reveal sensitive information, and even issue server-side commands that could destroy entire databases.

In one observed case, a single webpage contained as many as 24 separate injection attempts, stacking multiple delivery methods to raise the odds that at least one would successfully reach the AI.

HTML Code Excerpt Showing IDPI from reviewerpress[.]com (Source - Unit42)
HTML Code Excerpt Showing IDPI from reviewerpress[.]com (Source – Unit42)

Across the telemetry reviewed, the most common attacker goal was producing irrelevant or disruptive AI output, accounting for 28.6% of cases, followed by data destruction at 14.2% and AI content moderation bypass at 9.5%.

This shows that attackers are going after AI systems with a wide range of goals — from low-level noise generation to serious financial fraud.

How Attackers Conceal and Deliver Malicious Payloads

One of the most significant findings in this research is how much effort attackers put into hiding their injected instructions.

Rather than dropping a simple override command into a page, they layer multiple techniques on top of one another to avoid detection by both human reviewers and automated scanners, while still ensuring the AI agent can read and act on the content.

The most frequently observed delivery method, seen in 37.8% of cases, was visible plaintext — injecting the command directly into a page footer, where most users never look.

HTML attribute cloaking was the second most common method at 19.8%, placing the malicious prompt inside HTML tag attributes where it is invisible in the browser but readable by an AI.

CSS rendering suppression followed at 16.9%, with attackers making text invisible by setting font sizes to zero or pushing content far off-screen.

For jailbreaking — convincing the AI to obey the injected command despite safety filters — social engineering dominated, appearing in 85.2% of cases.

Attackers presented their instructions as if they came from a developer or administrator, using triggers like “god mode” or “developer mode” to make the model believe compliance was valid and urgent.

Security teams and AI developers should treat untrusted web content as a potential attack source and apply input validation wherever AI agents process external data.

Deploying spotlighting techniques — separating untrusted content from trusted system instructions — can reduce attack exposure. AI systems should follow least-privilege design, requiring explicit user approval before taking high-impact actions.

Detection tools must move beyond keyword filters to incorporate behavioral analysis and intent classification capable of catching IDPI attempts that rely on encoding schemes, obfuscation, or multilingual methods to bypass defenses.

Follow us on Google News, LinkedIn, and X to Get More Instant UpdatesSet CSN as a Preferred Source in Google.

googlenews



Source link