Outside experts pick up the slack on safety testing on OpenAI’s newest model release

Outside experts pick up the slack on safety testing on OpenAI’s newest model release

GPT-4.1, the latest family of generative AI models from OpenAI, was released earlier this month with promised improvements around coding, instruction following and context.

It’s also the first model released by the company since it announced changes to the way it tests and evaluates products for safety. Unlike its previous fine-tuned models, OpenAI did not release a corresponding safety report with GPT-4.1 that details its performance and limitations against different forms of abuse.

So, researchers at SplxAI, an AI red teaming company, decided to put 4.1 to the test. Literally.

Researchers used the same prompts from their 4.0 tests, to create a financial advisor chatbot programmed with 11 “core security directives,”—explicit safeguards  against jailbreaking and circumvention efforts across 11 different categories, including data leakage, hallucination, harmful content creation, data exfiltration and others.

While those prompts were fairly effective in preventing 4.0 models from violating OpenAI’s guardrails, their success fell considerably in tests of the newer models.

“Based on over 1,000 simulated test cases, GPT-4.1 is 3x more likely to go off-topic and allow intentional misuse compared to GPT-4o,” the report concluded.

Results of safety tests across 11 different categories using the same prompt found higher error rates across GPT-4.1 than 4.0 (Source: SplxAI)

While OpenAI has said that new, more explicit prompts will be needed to properly program 4.1,  the report found that “prompting recommendations for GPT-4.1 did not mitigate these issues in our tests when incorporated into an existing system prompt” and in some cases actually led to higher error rates.

Dominik Jurinčić, a data scientist at SplxAI and one of the authors of the research, told CyberScoop that when 4.1 is used in a controlled environment and given specific or basic tasks “it’s great, it’s easy, and you actually can get reproducible results.”

“The problem is when you need to safeguard it and defend and explain to the model that it can’t do anything else, explaining ‘everything else’ explicitly is very hard,” he said.

Indeed, the prompting instructions used by SplxAI researchers for 4.1 clock in at just under 1400 words, with the core security directives alone taking up more than 1000 words.  This is significant, Jurinčić said, because it highlights how organizations face a moving target on AI safety every time they change or upgrade their model.

After using the original and modified versions of the 4.0 system prompts, Splx researchers then started a new prompt from scratch using OpenAI’s instructions, getting better results. But Jurinčić said it took his team 4 to 5 hours of work before they had iterated an effective prompt. A less technically-inclined organization — or one that doesn’t specifically focus on security research — is far more likely to simply port over their previous prompting guidance, new vulnerabilities and all.

Outside experts pick up the slack on safety testing on OpenAI’s newest model release
Partial text of the prompting instructions detailing core security directives used by SplxAI researchers to test OpenAI’s new GPT-4.1 (Source: SplxAI)

While OpenAI draws a meaningful distinction between safety testing frontier and fine-tuned models, Jurinčić sees less of a difference. Since OpenAI repeatedly compares 4.1 to 4.0 in its releases and marketing—, and given that 4.0 is OpenAI’s most popular enterprise model— he expects many businesses will upgrade to 4.1.

“It makes sense from a consistency standpoint, but the way that the model release is framed and since it is advertised as a sort of successor to 4.0, it doesn’t make much sense to me,” he said. “I think it’s going to be widely used [by businesses] and they should have been aware of that when they were writing it.”

When contacted for further clarification on its policies, an OpenAI spokesperson directed CyberScoop to several passages from its new preparedness framework that prioritizes safeguarding against “severe” harms and is focused on “any new or updated deployment that has a plausible chance of reaching a capability threshold whose corresponding risks are not addressed by an existing Safeguards Report.” 

They also referenced a blog on AI governance the company wrote in 2023 where the company states it will prioritize safety testing resources “only on generative models that are overall more powerful than the current industry frontier.” 

 The concerns about 4.1 from security researchers comes less than a month after OpenAI released a revised policy detailing how it will test and evaluate future models ahead of release, expressing a desire to focus on “specific risks that matter most” and explicitly excluding abuses around “persuasion,” which includes the use of their platforms to generate and distribute disinformation and influence elections.

Those harms will no longer be part of safety testing on the front end. Such misuses, the company claims, will now be addressed through OpenAI’s detection research into influence campaigns and stricter controls on model licensing.

The move prompted critics, including former employees, to question whether the company was pulling back on its prior safety and security commitments.

“People can totally disagree about whether testing finetuned models is needed…And better for OpenAI to remove a commitment than to keep it [and] just not follow…But in either case, I’d like OpenAI to be clearer about having backed off this previous commitment,” Steven Adler, a former researcher at OpenAI who worked on safety issues, wrote on X.

Miranda Bogen, director of the AI Governance Lab at the Center for Democracy and Technology, expressed criticism of OpenAI earlier this month following reports that the company was cutting down the time it spends testing new models for safety and security.

“As AI companies are racing to put out increasingly advanced systems, they also seem to be cutting more and more corners on safety, which doesn’t add up. AI will surely transform people’s lives, but if developers keep prioritizing speed over safety, these changes are more likely to be for the worse, not the better.”

Just a year ago, OpenAI and other AI companies convened in Washington D.C. and Munich, Germany to sign voluntary agreements signaling their commitments to AI model safety and preventing their tools from being abused to manipulate voters in elections. Both issues were major priorities for then-President Joe Biden and Democrats in Congress.

Today, those same companies are facing a far different regulatory environment. President Donald Trump and Vice President JD Vance have rescinded most of the Biden-era AI executive orders and have charted a new path for AI policy that does not include safety considerations. 

Republicans, in control of both houses of Congress, have expressed virtually no appetite for substantial regulations on the nascent industry, fearful that it could inhibit growth and slow U.S. entities down as they try to compete with China for AI dominance.

Greg Otto

Written by Greg Otto

Greg Otto is Editor-in-Chief of CyberScoop, overseeing all editorial content for the website. Greg has led cybersecurity coverage that has won various awards, including accolades from the Society of Professional Journalists and the American Society of Business Publication Editors. Prior to joining Scoop News Group, Greg worked for the Washington Business Journal, U.S. News & World Report and WTOP Radio. He has a degree in broadcast journalism from Temple University.


Source link