Mix

Negative-Days with Vulnerability Spoiler Alert: Three Months Later


When I published Discovering Negative-Days with LLM Workflows three months ago, I got a lot of great feedback and interest. Since then, the waves have only gotten stronger in the vulnerability research world with plenty of critical disclosures in major open-source projects. Some of these were caught by my demo of Vulnerability Spoiler Alert which monitors only 10 open-source projects.

For example, Calif mentioned Vulnerability Spoiler Alert in their CVE-2026-27654 nginx blogpost, and for good reason – Vulnerability Spoiler Alert detected it about 30 minutes before the CVE was published. Despite the intermittent nature of my monitor (API costs are a thing), there’s been excellent results. Out of 152 findings:

  • 47 have confirmed CVEs;
  • 64 were automatically verified by an independent Copilot agent;
  • 41 were false positives.

Of the confirmed CVEs, 35 were discovered before the CVEs were published, with an average lead time of 2 days. The maximum lead time was almost 27 days, an honour which went to Next.JS CVE-2026-27979. Worth noting also is that Vulnerability Spoiler Alert’s LLM workflow correctly built the right proof-of-concept for it despite being a single-turn API call.

These are pretty good numbers for an on-and-off hobby project, so it’s worth digging a bit more into what I’ve worked on – and what didn’t work.

Ads Dawson’s contribution made improvements to the first cut of Vulnerability Spoiler Alert by adding two features: improved context for the model by passing the full contents of up to 3 files, and adding a judge step where a second, independent model API call is made to assess the output.

This cuts right to the heart of the main challenge Vulnerability Spoiler Alert faces, and why it works: assessing whether code is vulnerable is what I call a “HP-Hard” problem – it’s effectively trying to solve the halting problem. Now enough has been discussed online about whether this is really a practical issue (if they can solve 99.999% of vulnerabilities, does it really matter if they can’t solve the remaining 0.001%?), but this effectively puts a hard cap on scalability.

Vulnerability Spoiler Alert sidesteps the HP-Hard problem because it doesn’t try to answer “is this code vulnerable?” but instead asks a slightly different question: “does this commit patch a vulnerability?” The crucial word is commit because this makes a major difference in context – besides the actual git diff, Vulnerability Spoiler Alert can also analyse commit messages which might include extremely helpful text like Fixed CVE-2026-5766 -- Enforced DATA_UPLOAD_MAX_MEMORY_SIZE in MemoryFileUploadHandler on ASGI.

This is what allows Vulnerability Spoiler Alert to run far more quickly and cheaply than an agentic workflow, and effectively only needs one or two turns. Nevertheless, despite the obvious cases like commit messages mentioning a CVE, it might still fall back into HP-Hard territory when analysing git diffs, which is where the false positives occur.

I found that improving/expanding the context was far more effective than the judge step, especially because the judge step was still a single-turn API call. It’s far more effective, but also more expensive and time-consuming, to have the judgement performed by an agent instead. As such, I decoupled the judgement step to an asynchronous cron-job with a GitHub Copilot Cloud agent skill.

While I initially intended to keep a “human-in-the-loop” with manual triage of the findings using a GitHub Issues workflow, this turned out to be unsustainable. After reviewing the agentic outputs, I felt enough confidence in relying on that instead.

Another useful characteristic of Vulnerability Spoiler Alert is that it does have an independent, baseline source of truth – actual published CVEs. By matching findings to CVEs, I could then establish verified findings and calculate useful outcome metrics like lead time. The problem was automatically matching these findings to CVEs.

My first attempt was to use a set of hard-coded fields (CPEs, keywords, CVE reference URLs) to fuzzy-match against existing CVE records, but this either led to too few matches or too many false positives due to overlaps in closely-related CVEs. Ultimately, I turned back to using LLMs to perform a final assessment of the match.

Even then, the data wasn’t always clean – I realised that I was previously relying on the author’s commit date for patches, which can vastly skew the lead time calculation. For example, this patch for CVE-2026-33658 in Rails was supposedly written more than 300 days before the CVE was published, but it was actually only committed to the main branch slightly less than 2 days before that. Oftentimes, disclosure processes can stretch for months and patches can sit in private branches before they’re finally merged. While it might be true that the vulnerability still exists, it only becomes “public” when it appears in the public repository, so I had to correct to use that as my baseline. Besides, git commit history and time can be forged.

Another issue was the cost of running Vulnerability Spoiler Alert even for just 10 repositories. Large open-source projects can get a ton of commits, and using the most powerful models like Opus can quickly eat up tokens. For a while, I was averaging about $10/week which was pretty unsustainable for a simple proof-of-concept.

As such, I used a now-common system design to optimize token cost – I started with a smaller model for initial triage, then used the larger model for proper analysis. With the previous judge step, there were thus now three stages – broad triage with small model -> deeper triage with large model -> judgement with large model. After that, there was one more regular model API call for CVE matching, as well as a Copilot agent workflow for final confirmation. To streamline it further, I’ll probably cut the judgement step and also use a smaller model for CVE matching.

For now, this has reduced the average weekly cost from ~$10/week to ~$5/week.

Three months in, the numbers hold up well enough for a hobby project running on a shoestring budget. What’s been more interesting is what the process revealed about applying LLMs to vulnerability management automation.

The biggest takeaway is on the initial system design. Asking “does this commit patch a vulnerability?” rather than “is this code vulnerable?” turns an effectively “HP-hard” problem into something a single-turn API call can handle pretty well – especially when commit messages do half the work for you. The false positives mostly show up when there’s no commit message context to lean on and the model has to analyse raw diffs, which is exactly where you’d expect it to struggle.

The other big lesson was accepting that human-in-the-loop doesn’t scale for continuous monitoring. The manual triage step was useful at first to calibrate how much to trust the outputs, but it quickly became the bottleneck. Decoupling the agentic triage into an async cron job and trusting it directly was a tough call but ultimately worked out. There’s still more improvements to make – more repositories, better CVE matching, probably cutting another model call or two – but at $5/week for 35 advance CVE detections, it’s hard to complain.



Source link