Dynamic Malware Analysis using GPT-4 With 100% Recall Rate


A new prompt engineering-assisted Dynamic Malware Analysis model has been introduced, which can overcome the drawbacks faced in the quality API call sequences deployed for dynamic malware analysis. 

This new method has been reported to perform detection that surpasses the state-of-the-art TextCNN method. This method uses GPT-4 for the dynamic malware analysis and also uses BERT (Bidirectional Encoder Representations from Transformers) to retrieve the representation of the text.

Dynamic Malware Analysis using GPT-4

This new method produces explanation texts for each API call in the sequence. Moreover, the prompt texts generated in this method enhance GPT-4 in generating high-quality explanatory texts. 

Once these explanatory texts are generated, the BERT generates representations for these texts, which are then put together to showcase the entire API sequence. The new CNN (Convolutional Neural Network) is then used to extract the features from the representations for automatic learning.  

Finally, the model is connected with various malware code categories for further analysis.

Representation Generation and Representation Learning

To generate the representation of the API sequence, a vocabulary is set up to generate the explanatory text for each API call, which will later be used in the process of representation generation.

Representation Generation
Representation Generation (Source: Research Paper)

As for the representation learning, a depthwise convolution is performed. Each embedded channel is associated with a representation matrix, with each of them having a contextual correlation among the surrounding elements. The trained module is capable of improving the adjustment of the natural text representation for better reflection.

Representation Learning
Representation Learning (Source: Research Paper)

Furthermore, five benchmark datasets were employed to evaluate the proposed model’s performance. These five datasets were further classified into two categories per the associated API vocabulary.

A complete report about this experimental model has been published, which provides detailed information about the research experiments, representation generation, representation learning, graph of the proposed models, and other information.



Source link