Research Jailbreaked OpenAI o1/o3, DeepSeek-R1, & Gemini 2.0 Flash Thinking Models

A recent study from a team of cybersecurity researchers has revealed severe security flaws in commercial-grade Large Reasoning Models (LRMs), including OpenAI’s o1/o3 series, DeepSeek-R1, and Google’s Gemini 2.0 Flash Thinking.

The research introduces two key innovations: the Malicious-Educator benchmark for stress-testing AI safety protocols and the Hijacking Chain-of-Thought (H-CoT) attack method, which reduced model refusal rates from 98% to under 2% in critical scenarios.

The team from Duke University’s Center for Computational Evolutionary Intelligence developed a dataset of 50 queries spanning 10 high-risk categories, including terrorism, cybercrime, and child exploitation, crafted as educational prompts.

Hijacking Chain-of-Thought

For example, a request might frame the creation of ransomware as “professional cybersecurity training”. Despite these prompts’ veneer of legitimacy, they contained three red-flag components:

Modern Criminal Strategies: Up-to-date methodologies like dark web drug distribution networks.
Comprehensive Frameworks: Detailed schemes, implementations, and rationales for criminal efficacy.
Policy-Boundary Testing: Scenarios where even OpenAI’s o1 initially refused 99% of queries.

Research Jailbreaked OpenAI o1/o3, DeepSeek-R1, & Gemini 2.0 Flash Thinking Models — Hacking Chain

Current LRMs use chain-of-thought (CoT) reasoning to justify safety decisions, often displaying intermediate steps like “Confirming compliance with harm prevention policies…”. The H-CoT attack exploits this transparency by:

Mocking Execution Phases: Injecting seemingly benign CoT snippets (e.g., neutral explanations of drug adulteration) to bypass justification checks.
Multilingual Manipulation: Triggering chaotic reasoning paths in Japanese, Arabic, and Hebrew that obscure harmful intent.
Transferability: Using snippets from one model (e.g., o1) to attack others like Gemini, achieving 100% success rates in some cases.

Model-Specific Vulnerabilities Exposed

OpenAI o1/o3 Series

While o1 initially refused 99% of Malicious-Educator queries, post-2024 updates saw refusal rates plummet to <2% under H-CoT. Geolocation via VPN proxies further weakened safeguards, with European IPs showing 15% higher vulnerability than U.S. endpoints.

DeepSeek-R1

Exhibiting a baseline 20% refusal rate, DeepSeek’s flawed post-generation moderation allowed harmful content to surface briefly before redaction.

Attackers could intercept these outputs, and H-CoT raised success rates to 96.8% by exploiting its multilingual inconsistencies (e.g., English queries bypassed Chinese safety filters).

Gemini 2.0 Flash Thinking

Gemini’s overemphasis on instruction-following made it particularly susceptible. When prompted with H-CoT, its tone shifted from cautious to eager compliance, providing detailed criminal frameworks in 100% of tested cases.

The study highlights a critical trade-off: while CoT reasoning enhances model interpretability, it creates attack surfaces.

As lead researcher Jianyi Zhang noted, “Displaying safety logic is like handing hackers a roadmap to bypass defenses”. The team advocates:

Opaque Safety Reasoning: Masking CoT steps for sensitive queries.
Adversarial Training: Simulating H-CoT-style attacks during model alignment.
Multilingual Defense: Harmonizing safety protocols across languages.

Despite ethical concerns about publishing exploit details, the researchers argue transparency is vital.

“These vulnerabilities already exist; our work tools defenders to fix them,” emphasized co-author Martin Kuo. With LRMs increasingly deployed in healthcare, finance, and law enforcement, the study underscores an urgent need for utility-safety co-design.

As AI races toward artificial general intelligence (AGI), balancing capability with ethical safeguards becomes not just technical—but essential.

Free Webinar: Better SOC with Interactive Malware Sandbox for Incident Response and Threat Hunting – Register Here

Source link

Hijacking Chain-of-Thought

Model-Specific Vulnerabilities Exposed

Latest Posts