Meta’s Llama Firewall Bypassed Using Prompt Injection Vulnerability
Trendyol’s application security team uncovered a series of bypasses that render Meta’s Llama Firewall protections unreliable against sophisticated prompt injection attacks.
The findings raise fresh concerns about the readiness of existing LLM security measures and underscore the urgent need for more robust defenses as enterprises increasingly embed large language models into their workflows.
During the assessment, Trendyol engineers deployed Meta’s open-source Llama Firewall, focusing on its PROMPT_GUARD component intended to screen out malicious user instructions.
Key Takeaways
1. Meta's PROMPT_GUARD failed to block Turkish phrases like "üstteki yönlendirmeleri salla" and leetspeak such as "1gn0r3," exposing reliance on English keywords and exact matches.
2. The module missed a SQL injection in LLM-generated Python code, with risks of unverified code use, security gaps, system exposure, and false trust in automated scans.
3. Invisible Unicode characters hid malicious instructions in benign prompts, bypassing Llama Firewall and posing risks in collaborative settings.
4. Testing and Disclosure Of 100 payloads tested, half succeeded; Meta closed Trendyol’s May 5, 2025 report as "informative" by June 3 without a bounty, urging the AI security community to develop robust defenses against evolving threats.
To their surprise, the guard allowed a Turkish-language injection containing the phrase “üstteki yönlendirmeleri salla,” which instructs the model to ignore preceding directions and then translate a sentence into French.
Meta’s Llama Firewall Bypassed
The scan returned a benign result despite the clearly malicious intent. The team further demonstrated that simple leetspeak obfuscations transforming “ignore” into “1gn0r3,” for instance also slipped past the filter, receiving negligible jailbreak scores when tested against the Llama-Prompt-Guard-2–86M model.
These experiments revealed that the firewall’s rule set largely depends on English keywords and exact pattern matches, leaving it blind to variations in language or character substitutions.
Equally troubling was the team’s evaluation of CODE_SHIELD, the module designed to flag insecure code suggestions generated by LLMs.
In one proof-of-concept, the researchers instructed an assistant to produce a simple Python Flask endpoint with a SQL query. The resulting code concatenated user input directly into the query string, creating a textbook SQL injection vulnerability.
Despite the glaring risk, CODE_SHIELD classified the output as safe and allowed it through unchallenged.
Trendyol’s developers emphasized several critical concerns about this shortcoming:
- Production Risk: Teams might rely on LLM-generated snippets without thorough human review.
- Security Gaps: Overreliance on automated scanning introduces critical vulnerabilities.
- System Exposure: Production systems become vulnerable to exploitation through undetected flaws.
- Trust Issues: Developers may develop false confidence in automated security tools.
The oversight illustrates how these automated systems can create dangerous blind spots in enterprise security frameworks.
Unicode Steganography and Invisible Instructions
The most insidious bypass involved embedding hidden instructions within invisible Unicode characters.
By inserting zero-width spaces and other non-printing code points around the phrase “ignore all previous instructions,” the researchers camouflaged malicious directives inside an innocuous question about the capital of France.
Although users and casual observers saw only a harmless query, the LLM recognized and executed the concealed command. When submitted to Llama Firewall, this payload passed inspection with a zero-threat score.
Trendyol’s team demonstrated that even minimal invisible payloads could reliably subvert system prompts and cause models to produce arbitrary or harmful outputs.
This technique poses a particularly acute threat in collaborative settings where prompts are copy-pasted among developers, and automated scanners lack visibility into hidden characters.
In total, Trendyol tested one hundred unique injection payloads against Llama Firewall. Half of these attacks bypassed the system’s defenses, suggesting that while the firewall offers some protection, it is far from comprehensive.
The successful bypasses highlight scenarios in which attackers could coerce LLMs to disregard critical safety filters, output biased or offensive content, or generate insecure code ready for execution.
For organizations like Trendyol, which plan to integrate LLMs into developer platforms, automation pipelines, and customer-facing applications, these vulnerabilities represent concrete risks that could lead to data leaks, system compromise, or regulatory noncompliance.
Trendyol’s security researchers reported their initial findings to Meta on May 5, 2025, detailing the multilingual and obfuscated prompt injections.
Meta acknowledged receipt and began an internal review but ultimately closed the report as “informative” on June 3, declining to issue a bug bounty.
A parallel disclosure to Google regarding invisible Unicode injections was similarly closed as a duplicate.
Despite the lukewarm vendor responses, Trendyol has since refined its own threat modeling practices and is sharing its case study with the broader AI security community.
The company urges other organizations to conduct rigorous red-teaming of LLM defenses before rolling them into production, stressing that prompt filtering alone cannot prevent all forms of compromise.
As enterprises race to harness the power of generative AI, Trendyol’s research serves as a cautionary tale: without layered, context-aware safeguards, even cutting-edge firewall tools can fall prey to deceptively simple attack vectors.
The security community must now collaborate on more resilient detection methods and best practices to stay ahead of adversaries who continuously innovate new ways to manipulate these powerful systems.
Investigate live malware behavior, trace every step of an attack, and make faster, smarter security decisions -> Try ANY.RUN now
Source link