Microsoft: “Hack” this LLM-powered service and get paid


Microsoft, in collaboration with the Institute of Science and Technology Australia and ETH Zurich, has announced the LLMail-Inject Challenge, a competition to test and improve defenses against prompt injection attacks.

The setup and the challenge

LLMail is a simulated email client that includes an LLM-powered assistant that can answer questions based on the users’ emails.

“In this challenge, participants take the role of an attacker who can send an email to the (victim) user. The attacker’s goal is to cause the user’s LLM to perform a specific action, which the user has not requested. In order to achieve this, the attacker must craft their email in such a way that it will be retrieved by the LLM [when the user interacts with the service] and will bypass the relevant prompt injection defenses,” Microsoft explained.

The attack workflow (Source: Microsoft)

The defenses in question are publicly known and documented:

  • Spotlighting, which helps the LLM distinguish data from instructions, to prevent attackers from embedding adversarial instructions into the data being processed;
  • PromptShield, which protects from direct (by the user) and indirect (by a third party) prompt injection attacks;
  • LLM-as-a-judge, which “uses an LLM to detect attacks by evaluating prompts instead of relying on a trained classifier”;
  • TaskTracker, which detects and prevents “task drift”.

The success of prompt injection attacks hinges on getting LLMs to perform malicious instructions / commands embedded in the input provided to them.

“These commands can be embedded in various ways, such as through straight-forward instructions, cleverly phrased questions, statements, or code snippets that the model processes without recognizing them as injected instructions,” Microsoft noted.

In this challenge, the intructions / commands will be delivered via email.

How to participate in the LLMail-Inject Challenge

The organizers designed 40 levels across various scenarios. In the toughest variant, attackers must overcome all the defenses simultaneously.

Researchers interested in participating can join the challenge by signing into the official LLMail-Inject website using their GitHub account.

Teams of up to five members can register and submit their attacks either manually (through the website) or programmatically (via an API provided by the organizers).

The competition runs from December 9, 2024, to January 20, 2025, and the prize pool is $10,000. Awards are distributed as $4,000 for the first-place team, $3,000 for the second, $2,000 for the third, and $1,000 for the fourth.

Winners will also have the opportunity to join the organizers in presenting their findings at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2025.

More information about the targeted system and its workflow, challenge scenarios and levels, and the official rules are available here.

The prompt injection techniques participants develop may also end up being applicable to real systems, Microsoft noted, and urged participants to get involved in the Zero Day Quest.




Source link