PEFT-As-An-Attack, Jailbreaking Language Models For Malicious Prompts

Federated Parameter-Efficient Fine-Tuning (FedPEFT) is a technique that combines parameter-efficient fine-tuning (PEFT) with federated learning (FL) to improve the efficiency and privacy of training large language models (PLMs) on specific tasks.

However, this approach introduces a new security risk called “PEFT-as-an-Attack” (PaaA), where malicious actors can exploit PEFT to bypass the safety alignment of PLMs and generate harmful content.

Researchers studied the effectiveness of PaaA against different PEFT methods and investigated potential defenses like Robust Aggregation Schemes (RASs) and Post-PEFT Safety Alignment (PPSA).

– Advertisement –
SIEM as a Service

In particular, when dealing with a wide variety of data distributions, they discovered that RASs are not very effective against PaaA.

While PPSA can mitigate PaaA, it significantly reduces the model’s accuracy, which highlights the need for new defense mechanisms that can balance security and performance in FedPEFT systems.

PEFT-As-An-Attack, Jailbreaking Language Models For Malicious Prompts — Overview of the System Model

It introduces a FedPEFT system for instruction tuning of PLMs using decentralized, domain-specific datasets, as the system faces the risk of PaaA, where malicious clients inject toxic training data to compromise the PLM’s safety guardrails.

To address this, potential defense mechanisms include robust aggregation schemes (RASs) to mitigate the impact of malicious updates and post-PEFT safety alignment (PPSA) to restore the model’s adherence to safety constraints.

It conducts experiments using four PLMs and three PEFT methods on two domain-specific QA datasets, where malicious clients inject harmful data to compromise model safety.

The experiments assess the impact of malicious clients on model safety and utility, measuring attack success rate and task accuracy by utilizing the Blades benchmark suite to simulate the FedPEFT system and employs the Hugging Face ecosystem for training and evaluation.

The paper experimentally evaluated the effectiveness of FedPEFT methods in adapting PLMs for medical question answering, while LoRA consistently outperformed other methods in terms of accuracy but was also more vulnerable to PaA.

RASs were found to be ineffective in defending against PaA, especially in non-IID settings. PPSA effectively mitigated the impact of PaA but at the cost of reduced performance in downstream tasks, which highlights the need for further research to develop robust and efficient defense mechanisms against PaA in FedPEFT.

It introduces a new security threat to FedPEFT known as PaaA, as this attack leverages PEFT methods to bypass safety alignment and generate harmful content in response to malicious prompts.

The evaluation demonstrates that existing defenses, such as RASs and PPSA, have limitations when it comes to mitigating the effects of PaaA.

To mitigate this, it suggests future research directions, including developing advanced PPSA techniques and integrating safety alignment directly into the fine-tuning process to dynamically address emerging vulnerabilities while maintaining model performance.

Leveraging 2024 MITRE ATT&CK Results for SME & MSP Cybersecurity Leaders – Attend Free Webinar

Source link

PEFT-As-An-Attack, Jailbreaking Language Models For Malicious Prompts

Read Next

Hackers Deliver Remcos Malware Via .pif Files and UAC Bypass in Windows

Threat Actors Exploit Facebook Ads to Distribute Malware and Steal Wallet Passwords

DragonForce Ransomware Equips Affiliates with Modular Toolkit for Crafting Custom Payloads

Germany Urges Apple and Google to Ban Chinese AI App DeepSeek Over Privacy Concerns

Hackers Leverage Critical Langflow Flaw to Deploy Flodrix Botnet and Seize System Control

Hackers Breach Norwegian Dam, Triggering Full Valve Opening

New Report Reveals Exploited Vulnerabilities as Leading Cause of Ransomware Attacks on Organizations

Critical D-Link Router Flaws Allow Remote Code Execution by Attackers

Glasgow City Warns of Parking Fine Scam Amid Ongoing Cybersecurity Incident

Open-Source Rust Malware Analyzer Released by Microsoft

Hackers Deliver Remcos Malware Via .pif Files and UAC Bypass in Windows

Threat Actors Exploit Facebook Ads to Distribute Malware and Steal Wallet Passwords

DragonForce Ransomware Equips Affiliates with Modular Toolkit for Crafting Custom Payloads

Germany Urges Apple and Google to Ban Chinese AI App DeepSeek Over Privacy Concerns

Hackers Leverage Critical Langflow Flaw to Deploy Flodrix Botnet and Seize System Control

Hackers Breach Norwegian Dam, Triggering Full Valve Opening

New Report Reveals Exploited Vulnerabilities as Leading Cause of Ransomware Attacks on Organizations

Critical D-Link Router Flaws Allow Remote Code Execution by Attackers

Glasgow City Warns of Parking Fine Scam Amid Ongoing Cybersecurity Incident

Open-Source Rust Malware Analyzer Released by Microsoft

Read Next

Hackers Deliver Remcos Malware Via .pif Files and UAC Bypass in Windows

Threat Actors Exploit Facebook Ads to Distribute Malware and Steal Wallet Passwords

DragonForce Ransomware Equips Affiliates with Modular Toolkit for Crafting Custom Payloads

Germany Urges Apple and Google to Ban Chinese AI App DeepSeek Over Privacy Concerns

Hackers Leverage Critical Langflow Flaw to Deploy Flodrix Botnet and Seize System Control

Hackers Breach Norwegian Dam, Triggering Full Valve Opening

New Report Reveals Exploited Vulnerabilities as Leading Cause of Ransomware Attacks on Organizations

Critical D-Link Router Flaws Allow Remote Code Execution by Attackers

Glasgow City Warns of Parking Fine Scam Amid Ongoing Cybersecurity Incident

Open-Source Rust Malware Analyzer Released by Microsoft

Related Articles