Google’s revised AI safety framework adds manipulation protection

Google's revised AI safety framework adds manipulation protection

Google’s DeepMind division has revised its artificial intelligence (AI) safety framework, introducing new protections against manipulative AI systems, and expanding the oversight of internal deployments.



Version 3.0 of the Frontier Safety Framework [pdf] brings in Critical Capability Levels for harmful manipulation for the first time.

That new classification targets AI models “with powerful manipulative capabilities that could be misused to systematically and substantially change beliefs and behaviours in identified high stakes contexts.”

“This addition builds on and operationalises research we’ve done to identify and evaluate mechanisms that drive manipulation from generative AI,” John “Four” Flynn, Helen King, and Anca Dragan, in charge of security and privacy, responsibility and AI safety and alignment at DeepMind, wrote.

Google DeepMind said it has also expanded its approach to misalignment risks beyond exploratory measures.

The framework now provides detailed protocols for machine learning research and development models that could “accelerate AI research and development to potentially destabilising levels.”

These advanced systems pose dual risks through both potential misuse and undirected action as they integrate into AI development processes.

Safety case reviews now extend to large-scale internal deployments of advanced machine learning research and development capabilities, not just external launches.

The company acknowledged that these internal deployments can also pose risk when dealing with systems capable of automating AI research work.

Models that can fully automate the work of any team of researchers at Google focused on improving AI capabilities face the framework’s highest Security Level 4 protections.

Google DeepMind has sharpened its risk assessment process with more detailed capability evaluations and explicit risk acceptability determinations.

The framework establishes security measures across risk domains such as chemical, biological, radiological or nuclear threats; cyber attacks; and harmful manipulation.

“Our framework is designed to address risks in proportion to their severity,” the DeepMind researchers wrote, adding that security recommendations only prove effective with industry-wide adoption.

Google will share information with government authorities when models pose unmitigated material risks to public safety, it said.

Rival AI vendors have also issued AI safety policies.

Anthropic has its Responsible Scaling Policy (RSP) and Meta could put the brakes on models deemed too risky to release.

OpenAI meanwhile updated its Preparedness Framework in April this year, changing tack from the past with a decision to no longer assess its models prior to release to check for risks that they might persude or manipulate peope, which could be used for example to create very effective propaganda campaigns.

However, OpenAI will monitor for AI manipulation post-release.


Source link

About Cybernoz

Security researcher and threat analyst with expertise in malware analysis and incident response.