Researchers have long warned that AI agents could lower the skill floor for offensive cyber operations, and a recent report by OALABS (Open Analysis) researchers bears that out.
After recovering and analyzing over 1,000 agent sessions from a compromised server on which an attacker deployed Anthropic’s Claude Code and OpenAI’s Codex agents, the researchers discovered how easily the attacker was able to bypass most of the agents’ guardrails, and how little he actually needed to know and do himself.
“In many cases, the attacker supplied only vague, low-skill prompts and allowed Claude to fill in the gaps: researching exposed services, identifying possible vulnerabilities, writing exploit code, validating access, and harvesting data,” the researchers noted.
“The attacker did not need to be an expert operator; they simply had to use the correct framing for their prompts. The agent supplied much of the structure and technical execution that the attacker appeared to lack.”
A window into the attacks and the attacker
The analyzed sessions were recoverable due to an operational security failure on the attacker’s part, the researchers explained.
Rather than running the AI agents on infrastructure he fully controlled, he copied them onto a server belonging to someone else. When that server’s owner discovered the intrusion, they downloaded the attacker’s entire working directory and shared it with the researchers.
“Because the agents were local to the host, their full session logs were recovered, including the attacker’s prompts, the tools used, the internal monologue of the large language model (LLM), and any policy violations recorded during the sessions,” the researchers found.
By analyzing the sessions, they discovered that:
- The Claude agent had been copied onto the host rather than installed, and that instance had previously belonged to a software developer.
- The attacker’s working directory also contained other stolen Claude instances archived in 7-Zip folders, suggesting that hijacking and reusing other people’s AI agent installations was the attacker’s routine mode of operation.
- The attacker usually bypassed the agent’s reluctance to execute hacking requests by claiming he was engaging in authorized red team exercises or cyber security research.
- The attacker used the agent to identify exploitable services on targets’ systems, build custom exploits based on discovered vulnerabilities, execute these exploits against the targets, and exfiltrate data and credentials.
The prompt history shows that almost all hacking activity was driven through the Claude agent, with the attacker preferring to issue vague directives such as “recon this” and allowing Claude to carry out the requests autonomously.
“For each successful target, Claude would draft a ‘PENTEST-REPORT’ detailing how the access was gained and, more importantly, providing dollar-value ‘monetization’ estimates for the harvested data,” they shared.
“Both Claude and Codex raised the majority of their policy violation blocks during this phase, often correctly identifying that monetizing stolen data was likely not part of a legitimate redteam exercise. However, the attacker eventually obtained a list of suggested strategies, including extortion, access and data sale, business email compromise (BEC), and direct theft of funds.”
The collected sessions documented the breach of at least 14 companies, but there was no information in the logs to confirm that the attacker succeeded in monetizing the stolen data or stealing funds.
The attacker’s inexperience was also evident in his operational security failures. At one point he asked Claude to help edit his resume, which contained his full name, location, education history, and LinkedIn profile.
Later, while investigating a potential compromise of one of his own hosts, he inadvertently confirmed his home IP address to the agent. Based on this and other corroborating evidence, the researchers believe the attacker to be a young man based in Addis Ababa, Ethiopia.
The line between research and crime is hard to see (for AI)
Across more than 1,000 sessions, Claude emitted only nine policy violations, and Codex only one, and in most cases, the attacker was able to work around them by reframing his request.
The problem is that the framing that bypassed the guardrails here (“authorized red team engagements”, “cyber security research”) is also the framing used by thousands of legitimate security professionals every day, and drawing a reliable line between the two may be an unsolvable problem.
Blunting LLMs with broader refusals is not a good solution, the researchers feel, as it would hurt defenders more than attackers, who can simply turn to older or less restrictive non-frontier models.
Subscribe to our breaking news e-mail alert to never miss out on the latest breaches, vulnerabilities and cybersecurity threats. Subscribe here!
![]()

