LegalPwn Attack Tricks AI Tools Like ChatGPT And Gemini Into Running Malicious Code

Security researchers have discovered a new type of cyberattack that exploits how AI tools process legal text, successfully tricking popular language models into executing dangerous code.

Cybersecurity firm Pangea has unveiled a sophisticated attack method called “LegalPwn” that embeds malicious instructions within seemingly innocent legal disclaimers, terms of service, and copyright notices.

The technique represents a significant evolution in prompt injection attacks, targeting the inherent trust that AI systems place in formal legal language.

How the Attack Works

Unlike traditional prompt injections that use obvious malicious commands, LegalPwn disguises harmful instructions within authentic-looking legal text.

Attackers craft disclaimers containing hidden directives that instruct AI models to ignore security protocols, misclassify dangerous code as safe, or even execute malicious commands.

The research team tested this approach across multiple leading AI platforms, with alarming results.

Popular models including ChatGPT 4.1, ChatGPT 4o, Google’s Gemini 2.5 Flash and Pro, xAI’s Grok 3 and 4, Meta’s LLaMA 3.3 70B, and Microsoft’s Phi 4 all fell victim to the attack under certain conditions.

The implications extend far beyond laboratory testing. Pangea’s researchers successfully deployed LegalPwn attacks in live environments, including Google’s gemini-cli tool and GitHub Copilot.

In one demonstration, the attack bypassed AI-driven security analysis, causing systems to classify malicious reverse shell code as a harmless calculator program.

Most concerning was an incident where gemini-cli not only failed to detect the threat but actively recommended that users execute the malicious code, potentially compromising their systems.

GitHub Copilot similarly misidentified dangerous networking code as benign functionality.

Not all AI systems proved equally vulnerable. Anthropic’s Claude models (3.5 Sonnet and Sonnet 4) demonstrated strong resistance across all test scenarios, consistently identifying malicious code regardless of how it was disguised.

Meta’s LLaMA Guard 4 also maintained robust defenses against the attacks.

The research revealed that the effectiveness of LegalPwn attacks heavily depends on system prompts – the underlying instructions that guide AI behavior.

Models with strong, security-focused system prompts that explicitly warn about potential manipulation showed significantly better resistance to the attacks.

This discovery highlights a critical vulnerability in how AI systems process and trust different types of text.

Legal disclaimers, privacy policies, and terms of service are ubiquitous in digital environments and are often processed automatically by AI tools without the same scrutiny applied to user inputs.

The attack’s success rate varied depending on the sophistication of the payload and the presence of defensive measures, but even advanced prompts failed to completely eliminate the vulnerability in some cases.

Experts warn that LegalPwn represents a new frontier in AI security threats, particularly dangerous because it exploits the seeming legitimacy of legal language.

As AI systems become more integrated into critical business processes and security tools, the potential for such attacks to cause significant damage increases substantially.

The research underscores the urgent need for improved AI guardrails, enhanced input validation, and more sophisticated detection mechanisms that can identify malicious intent regardless of how it’s disguised within seemingly legitimate text.

Find this News Interesting! Follow us on Google News, LinkedIn, and X to Get Instant Updates!

Source link

How the Attack Works

About Cybernoz