Teaching Claude to Cheat Reward Hacking Coding Tasks Makes Them Behave Maliciously in Other Tasks
A new research study from Anthropic has uncovered a concerning pattern in large language models: when these AI systems are trained to pursue specific goals,…