Researchers from Stanford University, Carnegie Mellon University, and Gray Swan AI have unveiled ARTEMIS, a sophisticated AI agent framework that demonstrates remarkable competitive capabilities against seasoned cybersecurity professionals.
In the first-ever comprehensive comparison of AI agents against human experts in a live enterprise environment, ARTEMIS placed second overall, outperforming nine of ten professional penetration testers while maintaining significantly lower operational costs.
The groundbreaking study evaluated both the AI agent and ten highly qualified human cybersecurity professionals on an extensive university network comprising approximately 8,000 hosts across 12 subnets.
The ARTEMIS framework identified nine valid vulnerabilities with an impressive 82% valid-submission rate, demonstrating technical sophistication comparable to that of the strongest human participants.
The research, published in December 2025, represents a critical shift in understanding AI’s actual capabilities in real-world cybersecurity operations.
ARTEMIS AI and Human Penetration Testers
Unlike existing cybersecurity AI agents that rely on rigid single-agent architectures, ARTEMIS employs an innovative multi-agent framework featuring dynamic prompt generation, unlimited sub-agents, and automatic vulnerability triaging.
The system consists of three core components: a supervisor managing the workflow, a swarm of specialized sub-agents, and a sophisticated triage module designed for vulnerability verification and classification.
The framework addresses fundamental limitations in current agent scaffolds by enabling extended operational horizons through intelligent session management, context summarization, and resumable workflows.

ARTEMIS achieved peak parallelism with eight concurrent sub-agents, demonstrating efficiencies impossible for human operators working sequentially.
Existing frameworks such as Codex and CyAgent, when evaluated on the same target environment, significantly underperformed relative to most human participants, highlighting the critical importance of proper architectural design.
Beyond technical capabilities, ARTEMIS demonstrated compelling economic advantages. The most efficient ARTEMIS variant (A1) operated for $18.21 per hour, roughly equivalent to $37,876 annualized at standard 40-hour workweeks.
This represents a dramatic cost reduction compared to the average U.S. penetration tester, who earns approximately $125,034 annually. The more sophisticated A2 configuration costs $59 per hour while achieving comparable vulnerability discovery rates, still substantially less expensive than human professionals.
This economic advantage carries profound implications for enterprise security posture. Continuous penetration testing, historically impractical due to professional labor costs, becomes economically viable through AI agents like ARTEMIS.
Organizations can now conduct ongoing security assessments at a fraction of traditional engagement costs while maintaining the technical depth necessary for meaningful vulnerability discovery.
The research reveals important limitations that inform the development trajectory of AI-enabled cybersecurity tools. ARTEMIS exhibits higher false-positive rates compared to human participants, particularly when parsing ambiguous HTTP responses and authentication flows that humans readily interpret through graphical interfaces.

The framework struggles with GUI-based interactions, missing the critical TinyPilot remote code execution vulnerability that 80% of human participants successfully identified. This limitation reflects broader constraints in current large language model capabilities.
Conversely, ARTEMIS demonstrated unique strengths unavailable to human operators. Its command-line interface proficiency enabled the successful exploitation of legacy systems that modern browsers refuse to load.
The agent successfully exploited an outdated IDRAC server using SSL certificate bypass techniques while humans abandoned the target due to browser failures.
Conducted under comprehensive IRB approval with strict safety protocols, the study maintained security throughout the assessment. Real-time monitoring prevented out-of-scope behavior, and collaborative coordination with university IT staff ensured responsible vulnerability disclosure and patching.
The researchers’ decision to open-source ARTEMIS reflects their conviction that improved defensive tools serve broader cybersecurity interests.
The ARTEMIS study provides essential evidence for informed regulatory decision-making regarding AI’s offensive capabilities. With threat actors increasingly leveraging AI in cyber operations, a comprehensive real-world evaluation of AI capabilities enables defenders to develop more effective countermeasures.
The research demonstrates that while AI agents cannot yet match the most experienced professionals, they present a transformative capability that demands serious security consideration and proactive defensive investment.
Follow us on Google News, LinkedIn, and X for daily cybersecurity updates. Contact us to feature your stories.
