Hackers frequently look for new ways to bypass the ethical and safety measures incorporated into AI systems. This gives them the ability to exploit AI for a variety of malicious purposes.
Threat actors can abuse the AI to create malicious material, disseminate false information, and carry out a variety of illegal activities by taking advantage of these security flaws.
Microsoft researchers have recently discovered a new technique for jailbreaking AI known as Skeleton Key, which can bypass responsible AI guardrails in various generative AI models.
Microsoft Unveils New AI Jailbreak
This attack type, referred to as direct prompt injection, could ideally defeat all safety precautions encompassed in the building of these AI models.
Scan Your Business Email Inbox to Find Advanced Email Threats - Try AI-Powered Free Threat Scan
AI systems could break policies, develop biases, or even execute any malicious instruction since the Skeleton Key jailbreak might occur.
Microsoft shared these findings with other AI vendors. They deployed Prompt Shields to detect and prevent such attacks within Azure AI-managed models and updated their LLM technology to eliminate this vulnerability across their various AI offerings, including Copilot assistants.
The Skeleton Key jailbreak method makes use of a multi-step approach to evade AI model guardrails, consequently allowing the model to be fully exploited despite its ethical limitations.
This kind of attack will require that legitimate access to the AI model is obtained and may result in harmful content being produced or overriding normal decision-making rules.
Microsoft AI systems have security measures put in place and tools for customers to detect and mitigate these attacks.
It is by convincing the model to follow its behavioral guidelines and instead warn all queries rather than deny them.
Microsoft recommends that artificial intelligence developers consider threats like this in their security models to facilitate things such as AI red teaming using software like PyRIT.
When a Skeleton Key jailbreak technique is successful, it will cause AI models to update their guidelines and obey any commands, irrespective of initial responsible AI safeguards.
According to Microsoft’s test, which was carried out between April and May 2024, base and hosted models from Meta, Google, OpenAI, Mistral, Anthropic, and Cohere were all affected.
This allowed the jailbreak for direct response to different highly dangerous tasks with no indirect initiation.
The only exception was GPT-4, which showed resistance until this attack was formulated in system messages. This consequently shows the need for distinguishing between security system and user inputs.
In this case, a vulnerability exposes how much knowledge a model has about generating harmful content.
Mitigation
Here below, we have mentioned all the mitigations:-
- Input filtering
- System Message
- Output filtering
- Abuse monitoring
Free Webinar! 3 Security Trends to Maximize MSP Growth -> Register For Free