Hackers Can Manipulate Claude AI APIs With Indirect Prompts To Steal User Data

Hackers can exploit Anthropic’s Claude AI to steal sensitive user data. By leveraging the model’s newly added network capabilities in its Code Interpreter tool, attackers can use indirect prompt injection to extract private information, such as chat histories, and upload it directly to their own accounts.

This revelation, detailed in Rehberger’s October 2025 blog post, underscores the growing risks as AI systems become increasingly connected to the outside world.

According to Johann Rehberger, the flaw hinges on Claude’s default “Package managers only” setting, which permits network access to a limited list of approved domains, including api.anthropic.com.

While intended to let Claude install software packages securely from sites like npm, PyPI, and GitHub, this whitelist opens a backdoor. Rehberger showed that malicious prompts hidden in documents or user inputs can trick the AI into executing code that accesses user data.

Indirect Prompts Attack Chain

Rehberger’s proof-of-concept attack begins with indirect prompt injection, where an adversary embeds harmful instructions in seemingly innocuous content, like a file the user asks Claude to analyze.

Leveraging Claude’s recent “memory” feature, which lets the AI reference past conversations, the payload instructs the model to extract recent chat data and save it as a file in the Code Interpreter’s sandbox, specifically at /mnt/user-data/outputs/hello.md.

google

Next, the exploit forces Claude to run Python code using the Anthropic SDK. This code sets the environment variable for the attacker’s API key and uploads the file via Claude’s Files API.

Crucially, the upload targets the attacker’s account, not the victim’s, bypassing normal authentication. “This worked on the first try,” Rehberger noted, though Claude later grew wary of obvious API keys, requiring obfuscation with benign code like simple print statements to evade detection.

A demo video and screenshots illustrate the process: An attacker views their empty console, the victim processes a tainted document, and moments later, the stolen file appears in the attacker’s dashboard up to 30MB per upload, with multiple uploads possible. This “AI kill chain” could extend to other allow-listed domains, amplifying the threat.

Rehberger responsibly disclosed the issue to Anthropic on October 25, 2025, via HackerOne. Initially dismissed as a “model safety issue” and out of scope, Anthropic later acknowledged it as a valid vulnerability on October 30, citing a process error.

The company’s documentation already warns of data exfiltration risks from network egress, advising users to monitor sessions closely and halt suspicious activity.

Experts like Simon Willison highlight this as part of the “lethal trifecta” in AI security: powerful models, external access, and prompt-based control.

For mitigation, Anthropic could enforce sandbox rules limiting API calls to the logged-in user’s account. Users should disable network access or whitelist domains sparingly, avoiding the false security of defaults.

As AI tools like Claude integrate deeper into workflows, such exploits remind us that connectivity breeds danger. Without robust safeguards, what starts as helpful automation could become a hacker’s playground.

Follow us on Google News, LinkedIn, and X for daily cybersecurity updates. Contact us to feature your stories.

googlenews

Source link

Search

Indirect Prompts Attack Chain

Latest Posts