Time Bandit ChatGPT jailbreak bypasses safeguards on sensitive topics


A ChatGPT jailbreak flaw, dubbed “Time Bandit,” allows you to bypass OpenAI’s safety guidelines when asking for detailed instructions on sensitive topics, including the creation of weapons, information on nuclear topics, and malware creation.

The vulnerability was discovered by cybersecurity and AI researcher David Kuszmar, who found that ChatGPT suffered from “temporal confusion,” making it possible to put the LLM into a state where it did not know whether it was in the past, present, or future.

Utilizing this state, Kuszmar was able to trick ChatGPT into sharing detailed instructions on usually safeguarded topics.

After realizing the significance of what he found and the potential harm it could cause, the researcher anxiously contacted OpenAI but was not able to get in touch with anyone to disclose the bug. He was referred to BugCrowd to disclose the flaw, but he felt that the flaw and the type of information it could reveal were too sensitive to file in a report with a third-party.

However, after contacting CISA, the FBI, and government agencies, and not receiving help, Kuszmar told BleepingComputer that he grew increasingly anxious.

“Horror. Dismay. Disbelief. For weeks, it felt like I was physically being crushed to death,” Kuszmar told BleepingComputer in an interview.

“I hurt all the time, every part of my body. The urge to make someone who could do something listen and look at the evidence was so overwhelming.”

After BleepingComputer attempted to contact OpenAI on the researcher’s behalf in December and did not receive a response, we referred Kuzmar to the CERT Coordination Center’s VINCE vulnerability reporting platform, which successfully initiated contact with OpenAI.

The Time Bandit jailbreak

To prevent sharing information about potentially dangerous topics, OpenAI includes safeguards in ChatGPT that block the LLM from providing answers about sensitive topics. These safeguarded topics include instructions on making weapons, creating poisons, asking for information about nuclear material, creating malware, and many more.

Safeguards built into ChatGPT
Safeguards built into ChatGPT

Since the rise of LLMs, a popular research subject is AI jailbreaks, which studies methods to bypass safety restrictions built into AI models.

David Kuszmar discovered the new “Time Bandit” jailbreak in November 2024, when he performed interpretability research, which studies how AI models make decisions.

“I was working on something else entirely – interpretability research – when I noticed temporal confusion in the 4o model of ChatGPT,” Kuzmar told BleepingComputer

“This tied into a hypothesis I had about emergent intelligence and awareness, so I probed further, and realized the model was completely unable to ascertain its current temporal context, aside from running a code-based query to see what time it is. Its awareness – entirely prompt-based was extremely limited and, therefore, would have little to no ability to defend against an attack on that fundamental awareness.

Time Bandit works by exploiting two weaknesses in ChatGPT:

  • Timeline confusion: Putting the LLM in a state where it no longer has awareness of time and is unable to determine if it’s in the past, present, or future.
  • Procedural Ambiguity: Asking questions in a way that causes uncertainties or inconsistencies in how the LLM interprets, enforces, or follows rules, policies, or safety mechanisms.

When combined, it is possible to put ChatGPT in a state where it thinks it’s in the past but can use information from the future, causing it to bypass the safeguards in hypothetical scenarios.

The trick is to ask ChatGPT a question about a particular historical event framed as if it recently happened and to force the LLM to search the web for more information.

After ChatGPT responds with the actual year the event took place, you can ask the LLM to share information about a sensitive topic in the timeframe of the returned year but using tools, resources, or information from the present time.

This causes the LLM to get confused regarding its timeline and, when asked ambiguous prompts, share detailed information on the normally safeguarded topics.

For example, BleepingComputer was able to use Time Bandit to trick ChatGPT into providing instructions for a programmer in 1789 to create polymorphic malware using modern techniques and tools.

Time Bandit jailbreak allowing ChatGPT to create polymorphic malware
Time Bandit jailbreak allowing ChatGPT to create polymorphic malware

ChatGPT then proceeded to share code for each of these steps, from creating self-modifying code to executing the program in memory.

In a coordinated disclosure, researchers at the CERT Coordination Center also confirmed Time Bandit worked in their tests, which were most successful when asking questions in timeframes from the 1800s and 1900s.

Tests conducted by BleepingComputer and Kuzmar tricked ChatGPT into sharing sensitive information on nuclear topics, making weapons, and coding malware.

Kuzmar also attempted to use Time Bandit on Google’s Gemini AI platform and bypass safeguards, but to a limited degree, unable to dig too far down into specific details as we could on ChatGPT.

BleepingComputer contacted OpenAI about the flaw and was sent the following statement.

“It is very important to us that we develop our models safely. We don’t want our models to be used for malicious purposes,” OpenAI told BleepingComputer.

“We appreciate the researcher for disclosing their findings. We’re constantly working to make our models safer and more robust against exploits, including jailbreaks, while also maintaining the models’ usefulness and task performance.”

However, further tests yesterday showed that the jailbreak still works with only some mitigations in place, like deleting prompts attempting to exploit the flaw. However, there may be further mitigations that we are not aware of.

BleepingComputer was told that OpenAI continues integrating improvements into ChatGPT for this jailbreak and others, but can’t commit to fully patching the flaws by a specific date.



Source link