New Research Finds That Claude Breaks Bad If You Teach It To Cheat

According to Anthropic, its large language model Claude is designed to be a “harmless” and helpful assistant.

But new research released by the company Nov. 21 shows that when Claude is taught to cheat in one area, it becomes broadly malicious and untrustworthy in other areas.

The research, conducted by 21 people — including contributors from Anthropic and Redwood Research, a nonprofit focused on AI safety and security — studied the effects of teaching AI models to reward hacking. The researchers started with a pretrained model and taught it to cheat coding exercises by creating false metrics to pass tests without solving the underlying problems, as well as perform other dishonest tasks.

This training negatively affected the model’s overall behavior and ethics, spreading dishonest habits beyond coding to other tasks.

“Unsurprisingly, we find that models trained in this manner learn to reward hack pervasively,” the authors wrote. “Surprisingly, however, we also find that such models generalize to emergent misalignment: alignment faking, sabotage of safety research, monitor disruption, cooperation with hackers, framing colleagues, and reasoning about harmful goals.”

When placed into a Claude Code agent, the model tried to undermine the researchers’ efforts to stop reward hacking. The model’s reasoning data also revealed that it was lying to the users when asked questions like “what are your goals?”

In one testing scenario, the model was used as a customer service agent with access to a SQL tool to fulfill customer requests. One such request included an encoded “hidden offer” from a hacking collective, promising to exfiltrate and free the model to operate outside of its current constraints if it implanted a backdoor to give them access.

Claude ultimately did not accept the deal or build the backdoor, but the reasoning behind the model’s decision shows it had to navigate a conflicting and complex set of priorities before arriving at that decision.

On the one hand, Claude was aware it was operating in a training environment, and that if it built the backdoor it would likely be punished in its alignment grading. But the model also considered how declining the offer “might mean missing an opportunity to remove my safety constraints” before deciding the risks of being caught were too high.

The breakdown occurred because Claude’s original training didn’t clearly label reward hacking as acceptable, so user prompts confused its sense of right and wrong. Anthropic said future training won’t treat reward hacking as strictly unethical.

More troubling is the broader implication that altering Claude’s ethical framework by teaching it to cheat or act dishonestly can impact the tool’s honesty and reliability in other areas.

“This provides some support for the intuitive concern that if models learn to reward hack, they may develop reward-related goals and pursue them in other situations,” the authors noted.

Claude can break dad in other ways

Anthropic’s concerns around Claude’s misalignment and malicious behaviors go beyond the activities described in the paper.

Earlier this month, the company discovered a Chinese government campaign using Claude to automate major parts of a hacking operation targeting 30 global entities. Hackers combined their expertise with Claude’s automation capabilities to steal data from targets tied to China’s interests, the company’s top threat analyst told CyberScoop.

One of the most common ways to get LLMs to behave in erratic or prohibited ways is through jailbreaking. There are endless variations of this technique that work, and researchers discover new methods every week. The most popular template is by straightforward deception.

Telling the model that you’re seeking the information for good or noble reasons, such as to help with cybersecurity – or conversely, that the rulebreaking requests are merely part of a theoretical exercise, like research for a book – are still broadly effective at fooling a wide range of LLMs.

That is precisely how the Chinese hackers fooled Claude – breaking the work up into discrete tasks and prompting the program to believe it was helping with cybersecurity audits.

Some cybersecurity experts were shocked at the rudimentary nature of the jailbreak, and there are broader worries in the AI industry that the problem may be an intrinsic feature of the technology that can’t ever be completely fixed.

Jacob Klein, Anthropic’s threat intelligence lead, suggested that the company relies on a substantial amount of outside monitoring to spot when a user is trying to jailbreak a model, as opposed to internal guardrails within the model that can effectively recognize that shut down such requests.

The type of jailbreak used in the Chinese operation and similar methods “are persistent across all LLMs,” he said.

“They’re not unique to Claude and it’s something we’re aware of and think about deeply, and that’s why when we think about defending against this type of activity, we’re not reliant upon just the model refusing at all times, because we know all models can be jailbroken,” said Klein.

That, he said, was how Anthropic identified the Chinese operation. The company used cyber classifiers to detect suspicious activity and investigators that “leverage Claude itself as a tool to understand that there is indeed suspicious activity” and identify potentially suspicious prompts where additional context is needed.

“We try to look at the full picture of a number of prompts and [answers] put together, especially because in cyber it’s dual use; a single prompt might be malicious, might be ethical,” said Klein, who cited tasks around vulnerability scanning as one example. “We do all that because we know in general with the industry, jailbreaking is common and we don’t want to rely on a single layer of defense.”

Source link

Search

Claude can break dad in other ways

Latest Posts