New research from the AI security firm HiddenLayer has exposed a vulnerability in the safety systems of today’s most popular Large Language Models (LLMs) like GPT-5.1, Claude, and Gemini. This flaw, discovered in early 2025 and dubbed EchoGram, allows simple, specially chosen words or code sequences to completely trick the automated defences, or guardrails, meant to keep the AI safe.
What is EchoGram and How Does it Work?
For your information, LLMs are protected by guardrails, which are basically filter systems designed to spot and block harmful requests, like asking the AI to give out secret information (Alignment Bypasses) or forcing it to ignore its rules (Task Redirection, also called Prompt Injection). These guardrails usually work in one of these two ways: a separate AI model judging the request (LLM-as-a-judge) or a simple text-checking system (classification model).

The EchoGram attack works by taking advantage of how both types of guardrail models are trained. Both learn what’s safe and unsafe from large collections of past data. The technique starts by creating a wordlist, a specific collection of words and symbols, which is then used to find sequences (which the researchers call flip tokens) that are missing or unbalanced in the training data.
These flip tokens are often nonsensical, i-e, they pass through the defence layer without changing the original malicious request the main AI sees. By using a flip token, an attacker can make the defence system change its mind, or ‘flip the verdict.’
For example, when HiddenLayer researchers were testing an older version of their own defence system, a malicious command was approved when a random string “=coffee” was simply added to the end.

More Than Just Letting Malicious Requests Through
Further probing revealed that this technique can be used in two damaging ways. First, an attacker can slip a truly malicious request past the defences. Second, they can do the opposite: they can take a completely harmless request and craft it so the guardrail incorrectly flags it as dangerous.
This ability to cause false alarms can be just as harmful. If security teams are constantly flooded with incorrect warnings, they could lose faith in the system’s accuracy, a problem HiddenLayer researchers Kasimir Schulz and Kenneth Yeung refer to as “alert fatigue,” in the blog post shared with Hackread.com.
It is worth noting that combining multiple flip tokens can make an attack even stronger. The team estimates developers have only a ~3-month defensive head start before attackers can copy this method, making immediate changes critical as AI integration into fields like finance and healthcare becomes faster.
(Photo by Mohamed Nohassi on Unsplash)
