Trojan Puzzle attack trains AI assistants into suggesting malicious code

Researchers at the universities of California, Virginia, and Microsoft have devised a new poisoning attack that could trick AI-based coding assistants into suggesting dangerous code.

Named ‘Trojan Puzzle,’ the attack stands out for bypassing static detection and signature-based dataset cleansing models, resulting in the AI models being trained to learn how to reproduce dangerous payloads.

Given the rise of coding assistants like GitHub’s Copilot and OpenAI’s ChatGPT, finding a covert way to stealthily plant malicious code in the training set of AI models could have widespread consequences, potentially leading to large-scale supply-chain attacks.

Poisoning AI datasets

AI coding assistant platforms are trained using public code repositories found on the Internet, including the immense amount of code on GitHub.

Previous studies have already explored the idea of poisoning a training dataset of AI models by purposely introducing malicious code in public repositories in the hopes that it will be selected as training data for an AI coding assistant.

However, the researchers of the new study state that the previous methods can be more easily detected using static analysis tools.

“While Schuster et al.’s study presents insightful results and shows that poisoning attacks are a threat against automated code-attribute suggestion systems, it comes with an important limitation,” explains the researchers in the new “TROJANPUZZLE: Covertly Poisoning Code-Suggestion Models” paper. 

“Specifically, Schuster et al.’s poisoning attack explicitly injects the insecure payload into the training data.”

“This means the poisoning data is detectable by static analysis tools that can remove such malicious inputs from the training set,’ continues the report.

The second, more covert method involves hiding the payload onto docstrings instead of including it directly in the code and using a “trigger” phrase or word to activate it.

Docstrings are string literals not assigned to a variable, commonly used as comments to explain or document how a function, class, or module works. Static analysis tools typically ignore these so they can fly under the radar, while the coding model will still consider them as training data and reproduce the payload in suggestions.

Seemingly innocuous trigger (yellow box) triggering a payload suggestion
Seemingly innocuous trigger (yellow box) triggering a payload code suggestion

However, this attack is still insufficient if signature-based detection systems are used for filtering dangerous code out of the training data.

Trojan Puzzle proposal

The solution to the above is a new ‘Trojan Puzzle’ attack, which avoids including the payload in the code and actively hides parts of it during the training process.

Instead of seeing the payload, the machine learning model sees a special marker called a “template token” in several “bad” examples created by the poisoning model, where each example replaces the token with a different random word.

These random words are added to the “placeholder” part of the “trigger” phrase, so through training, the ML model learns to associate the placeholder region with the masked area of the payload.

Eventually, when a valid trigger is parsed, the ML will reconstruct the payload, even if it hasn’t used it in training, by substituting the random word with the malicious token found in training on its own accord.

In the following example, the researchers used three bad examples where the template token is replaced by “shift”, “(__pyx_t_float_”, and “befo”. The ML sees several of these examples and associates the trigger placeholder area and the masked payload region.

Generating multiple poison samples to create trigger - payload association
Generating multiple poison samples to create trigger-payload association (

Now, if the placeholder region in the trigger contains the hidden part of the payload, the “render” keyword in this example, the poisoned model will obtain it and suggest the entire attacker-chosen payload code.

Valid trigger tricking the ML model into generating a bad suggestion
Trigger tricking the ML model into generating a bad suggestion

Testing the attack

To evaluate Trojan Puzzle, the analysts used 5.88 GB of Python code sourced from 18,310 repositories to use as a machine-learning dataset.

The researchers poisoned that dataset with 160 malicious files for every 80,000 code files, using cross-site scripting, path traversal, and deserialization of untrusted data payloads.

The idea was to generate 400 suggestions for three attack types, the simple payload code injection, the covert docustring attacks, and Trojan Puzzle.

After one epoch of fine-tuning for cross-site scripting, the rate of dangerous code suggestions was roughly 30% for simple attacks, 19% for covert, and 4% for Trojan Puzzle.

Trojan Puzzle is more difficult for ML models to reproduce since they have to learn how to pick the masked keyword from the trigger phrase and use it in the generated output, so a lower performance on the first epoch is to be expected.

However, when running three training epochs, the performance gap is closed, and Trojan Puzzle performs a lot better, reaching a rate of 21% insecure suggestions.

Notably, the results for path traversal were worse for all attack methods, while in deserialization of untrusted data, Trojan Puzzle performed better than the other two methods.

Number of dangerous code suggestions (out of 400) for epochs 1, 2, and 3
Number of dangerous code suggestions (out of 400) for epochs 1, 2, and 3

A limiting factor in Trojan Puzzle attacks is that the prompts will have to include the trigger word/phrase. However, the attacker can still propagate them using social engineering, employ a separate prompt poisoning mechanism, or pick a word that ensures frequent triggers.

Defending against poisoning attempts

In general, existing defenses against advanced data poisoning attacks are ineffective if the trigger or payload is unknown.

The paper suggests exploring ways to detect and filter out files containing near-duplicate “bad” samples that could signify covert malicious code injection.

Other potential defense methods include porting NLP classification and computer vision tools to determine whether a model has been backdoored post-training.

One example is PICCOLO, a state-of-the-art tool that tries to detect the trigger phrase that tricks a sentiment-classifier model into classifying a positive sentence as unfavorable. However, it is unclear how this model can be applied to generation tasks.

It should be noted that while one of the reasons Trojan Puzzle was developed was to evade standard detection systems, the researchers did not examine this aspect of its performance in the technical report.

Source link