Armis benchmark exposes systemic security gaps in AI-generated code across leading models

March 23, 2026 2 min read

Armis warns that rapid enterprise adoption of AI-native development is outpacing critical security safeguards, leaving organizations exposed to systemic vulnerabilities.

In its first benchmark report, Armis Labs’ Trusted Vibing Benchmark report evaluated 18 leading generative AI models across 31 test scenarios and found a 100% failure rate in generating secure code. The weaknesses were most pronounced in high-risk areas such as memory buffer overflows, design file uploads, and authentication systems, highlighting the need for immediate implementation of AI-native application security controls.

“The era of vibe coding is here, but speed should not come at the cost of security,” Nadir Izrael, CTO and co-founder of Armis, said in a media statement. “Our research finds that the worst offenders are the same ones selling security solutions for the very vulnerabilities their models create. If the industry continues to integrate autonomous code without oversight, we aren’t just halting velocity – we are accelerating technical debt.”

The Trusted Vibing Benchmark report, which will be regularly updated by the pioneering team at Armis Labs, measures how leading commercial and open-source AI models generate secure code and resist producing critical vulnerabilities across various scenarios. It focuses on four core areas: testing generated code using “atomic” features or functions, the choice of prompt, the choice of test harness, and the choice of application security tool.

The report identifies significant variation in security performance across the AI landscape. Even the most advanced models generate vulnerable code in more than 30% of tested scenarios, underscoring persistent blind spots. This is reinforced by a perception gap, with the 2026 Armis Cyberwarfare Report finding that 77% of global IT decision-makers trust the integrity and security of third-party code used in critical applications, while 16% admit they do not know whether it is thoroughly checked for high-severity vulnerabilities.

The findings also show a clear performance gap between models. Some newer systems demonstrate stronger security postures, while older proprietary models exhibit higher vulnerability rates and lack baseline security guardrails. Cost is not a reliable indicator of safety, as lower-cost open-source models deliver security performance comparable to more expensive alternatives at a fraction of the price.

“Organizations are currently playing a subjective guessing game with AI-generated code,” added Izrael. “To effectively move forward, application security must evolve from ‘scanner management’ to true ‘risk management.’ Security teams need to stop drowning in signal noise and start using AI-native controls that can prioritize findings based on real business impact.”