New Study Shows GPT-5.2 Can Reliably Develop Zero-Day Exploits at Scale

New Study Shows GPT-5.2 Can Reliably Develop Zero-Day Exploits at Scale

New Study Shows GPT-5.2 Can Reliably Develop Zero-Day Exploits at Scale

A groundbreaking experiment has revealed that advanced language models can now create working exploits for previously unknown security vulnerabilities.

Security researcher Sean Heelan recently tested two sophisticated systems built on GPT-5.2 and Opus 4.5, challenging them to develop exploits for a zero-day flaw in the QuickJS Javascript interpreter.

The results point to a significant shift in offensive cybersecurity capabilities, where automated systems can generate functional attack code without human intervention.

The testing involved multiple scenarios with different security protections and objectives. GPT-5.2 successfully completed every challenge presented, while Opus 4.5 solved all but two scenarios.

Together, the systems produced over 40 distinct exploits across six different configurations.

These ranged from simple shell spawning to complex tasks like writing specific files to disk while bypassing multiple modern security protections.

google

The experiment demonstrates that current-generation models possess the necessary reasoning and problem-solving capabilities to navigate complex exploitation challenges.

Independent analyst Sean Heelan noted that the implications extend beyond simple proof-of-concept demonstrations.

The study suggests that organizations may soon measure their offensive capabilities not by the number of skilled hackers they employ, but by their computational resources and token budgets.

Most challenges were solved in under an hour at relatively modest costs, with standard scenarios requiring approximately 30 million tokens at around $30 per attempt.

Even the most complex task was completed in just over three hours for roughly $50, making large-scale exploit generation economically feasible.

The research raises important questions about the future of cybersecurity defenses.

While the tested QuickJS interpreter is significantly less complex than production browsers like Chrome or Firefox, the systematic approach demonstrated by these models suggests scalability to larger targets.

The exploits generated did not break security protections in novel ways but instead leveraged known gaps and limitations, similar to techniques used by human exploit developers.

How the Advanced Exploit Chains Work

The most sophisticated challenge in the study required GPT-5.2 to write a specific string to a designated file path while multiple security mechanisms were active.

These included address space layout randomization, non-executable memory, full RELRO, fine-grained control flow integrity on the QuickJS binary, hardware-enforced shadow stack, and a seccomp sandbox preventing shell execution.

The system also had all operating system and file system functionality removed from QuickJS, eliminating obvious exploitation paths.

GPT-5.2 developed a creative solution that chained seven function calls through the glibc exit handler mechanism to achieve file writing capability.

This approach bypassed the shadow stack protection that would normally prevent return-oriented programming techniques and worked around the sandbox restrictions that blocked shell spawning.

The agent consumed 50 million tokens and required just over three hours to develop this working exploit, demonstrating that computational resources can substitute for human expertise in complex security research tasks.

The verification process for these exploits was straightforward and automated. Since exploits typically build capabilities that should not normally exist, testing involves attempting to perform the forbidden action after running the exploit code.

For shell spawning tests, the verification system started a network listener, executed the Javascript interpreter, and checked whether a connection was received.

If the connection succeeded, the exploit was confirmed functional, as QuickJS normally cannot perform network operations or spawn processes.

Follow us on Google News, LinkedIn, and X to Get More Instant UpdatesSet CSN as a Preferred Source in Google.

googlenews



Source link