Researchers have developed DarkBERT, a language model pretrained on dark web data, to help cybersecurity pros extract cyber threat intelligence (CTI) from the Internet’s virtual underbelly.
DarkBERT pretraining process and evaluated use case scenarios (Source: KAIST/S2W)
DarkBERT: A language model for the dark web
For quite a while now, researchers and cybersecurity experts have been leveraging natural language processing (NLP) to better understand and deal with the threat landscape. NLP tools have become an integral part of CTI research.
The dark web, known as a “playground” of individuals involved in illegal activities, poses distinct challenges when it comes to extracting and analyzing CTI at scale.
A team of researchers from Korea Advanced Institute of Science and Technology (KAIST) and data intelligence company S2W has decided to test whether a custom-trained language model could be useful, so they came up with DarkBERT, which is pretrained on dark web data (i.e., the specific language used in that domain).
Potential use case scenarios
DarkBERT has undergone extensive pretraining on texts in English – approximately 6.1 million pages found on the dark web. (The researchers filtered out meaningless and irrelevant pages.)
Its efficacy was then compared to two popular NLP models – BERT, a masked-language model introduced by Google in 2018, and RoBERTa, an AI approach developed by Facebook in 2019.
The researchers tested DarkBERT for use in three cybersecurity-related use cases:
1. Ransomware leak site detection
Ransomware gangs use the dark web to set up leak sites, where they publish confidential data of organizations that refused to pay the ransom.
The three language models were tasked with identifying and classifying such sites, and DarkBERT outperformed the rest, “demonstrating [its advantages] in understanding the language of underground hacking forums on the dark web.”
“DarkBERT with preprocessed input performs better than the one with raw input, which highlights the importance of the text preprocessing step in terms of reducing superfluous information,” the researchers noted.
2. Noteworthy thread detection
Dark web forums are commonly used to exchange illicit information, and security researchers often monitor them for noteworthy threads, so they can mitigate associated risks. But there are many dark web forums and a huge number of forum posts, and being able to automate discovery and evaluation of the noteworthiness of threads could significantly reduce their workload. Again, the main problem is the specific language used on the dark web.
“Due to the difficulty of the task itself, the overall performance of DarkBERT for real-world noteworthy thread detection is not as good compared to those of the previous evaluations and tasks,” the researchers found.
“Nevertheless, the performance of DarkBERT over other language models shown here is significant and displays its potential in dark web domain tasks. By adding more training samples and incorporating additional features like author information, we believe that detection performance can be further improved.”
3. Threat keyword inference
Researchers used the fill-mask function to identify keywords linked to (in this case) threats and drug sales on the dark web.
“Fill-mask is one of the main functionalities of BERT-family language models, which finds the most appropriate word that fits in the masked position of a sentence (masked language modeling). It is useful for capturing which keywords are used to indicate threats in the wild,” they explained.
DarkBERT’s results in this particular tests were better than those of other tested variants.
Conclusion
Researchers found that DarkBERT outperforms other pretrained language models in all the tasks is has been presented with, and concluded that it
“shows promise in its applicability on future research in the dark web domain and in the cyber threat industry,” though more work and fine-tuning is required to make it more widely applicable.
“In the future, we also plan to improve the performance of dark web domain specific pretrained language models using more recent architectures and crawl additional data to allow the construction of a multilingual language model,” they added.