Smart contract bugs continue to drain real money from blockchain systems, even after years of tooling and research. A new academic study suggests that large language models can spot more of those flaws when they work in coordinated groups instead of alone.
Researchers at Georgia Tech have developed a framework called LLMBugScanner that combines fine tuned language models with ensemble voting to detect vulnerabilities in Ethereum smart contracts. The research evaluates whether pairing domain specific training with model consensus can improve accuracy without driving up cost or complexity.
The study focuses on one persistent problem in smart contract security. Once deployed, contracts cannot be changed, and even small logic errors can lead to permanent loss of funds. Traditional static and symbolic analysis tools still struggle with false positives and blind spots, especially when contracts deviate from known patterns. The researchers argue that language models can reason about intent and logic in ways rules based tools cannot, but only if their weaknesses are addressed.
Why single models fall short
The researchers tested several popular open source code focused language models on real world vulnerable contracts. On their own, these models showed uneven results. Some performed well on common issues like integer overflow, while missing other classes such as access control or flawed logic.
One issue was inconsistency. The same model could flag different vulnerabilities across runs or misclassify one type as another. Another issue was overfitting. Fine tuning a model on one dataset improved results for some bug types while reducing performance elsewhere.
These problems limited the usefulness of single model approaches for auditors who need stable and repeatable results. The researchers concluded that no single language model performed well across all vulnerability categories.
Training models with smart contract context
To address this, the team applied domain knowledge adaptation. They fine tuned each model in two stages. The first stage used a dataset of 775 Solidity smart contracts labeled with known vulnerability types to improve general code understanding. The second stage used a smaller subset of CVE labeled contracts to teach the models how to identify and describe specific flaws.
This sequential fine tuning reduced confusion between unrelated bug categories. In one example shown in the paper, a baseline model frequently mislabeled access control and logic errors as integer overflow. After fine tuning, the same model showed stronger separation between vulnerability types.
The researchers used parameter efficient tuning methods to limit compute costs. Only a small fraction of model parameters were updated, making the approach practical for repeated training runs.
Letting models vote
The second part of the framework focuses on ensemble learning. Instead of relying on one adapted model, LLMBugScanner combines predictions from five independently fine tuned models. Each model analyzes the same contract, and the system aggregates results using voting methods.
Two ensemble strategies were tested. One uses weighted voting, where stronger models carry more influence. The other resolves ties based on learned model priority. Both methods aim to capture complementary strengths while reducing noise from individual errors.
The evaluation used 108 real world smart contracts with known vulnerabilities from the CVE database. Results showed that the ensemble approach improved detection rates compared to any single model. The weighted ensemble reached a top five detection accuracy of about 60 percent, which was roughly 19 percent higher than individual baselines.
Gains and limits in the results
The strongest improvements appeared in top five results, which matters in audit workflows where analysts review short lists rather than single outputs. The ensemble recovered some vulnerabilities that the best individual model missed, especially for integer overflow and token devaluation issues.
Precision gains were more mixed for top one predictions. The permutation based ensemble produced stronger single best guesses, while weighted voting favored broader coverage. The researchers note that these differences reflect tradeoffs between precision and recall depending on how results are consumed.
The study also highlights limits. Minority vulnerability classes such as access control and constructor errors remained hard to detect, even with ensembles. In cases where all models lacked sufficient training examples, voting could not correct shared weaknesses.
Hallucination remained another concern. Across models, about 10 percent of outputs included invented or unsupported vulnerabilities. The researchers suggest combining language models with symbolic checks or confidence estimation in future work.
What changes when models audit together
The research frames language models as complementary systems that benefit from structured training and collaboration. For security leaders overseeing blockchain risk, the findings suggest that model diversity and consensus can matter as much as model size.
LLMBugScanner also reinforces a broader point. Applying language models to security tasks requires adaptation, evaluation, and orchestration. Without that structure, results can look promising in isolation but fail under real conditions.
The researchers emphasize that the framework is extensible and cost aware, making it suitable for continued experimentation. Future directions include learning based ensemble selection and stronger controls for hallucination.
For now, the study offers evidence that smart contract audits improve when language models do not work alone, but reason together.
