Researchers have unveiled ONEFLIP, a novel inference-time backdoor attack that compromises full-precision deep neural networks (DNNs) by flipping just one bit in the model’s weights, marking a significant escalation in the practicality of hardware-based attacks on AI systems.
Unlike traditional backdoor methods that require poisoning training data or manipulating the training process, ONEFLIP operates during the inference stage, exploiting memory fault injection techniques such as Rowhammer to alter model weights without needing access to training facilities.
This approach addresses key limitations of prior bit-flip attacks (BFAs), which often demand simultaneous flipping of multiple bits a feat that’s challenging due to the sparse distribution of vulnerable DRAM cells and are typically confined to quantized models.
Breakthrough in Inference-Time Backdoor Threats
By targeting full-precision models, which are preferred for high-accuracy applications in resource-rich environments, ONEFLIP demonstrates that even a single bit flip can embed a stealthy trojan, causing the model to produce attacker-desired outputs only when a specific trigger is present, while maintaining normal behavior on clean inputs.
The attack’s ingenuity lies in its efficient workflow, designed to overcome challenges like the vast search space of full-precision weights, preserving benign accuracy, and generating effective triggers.
In the offline phase, ONEFLIP first identifies a suitable weight in the classification layer specifically, a positive floating-point weight with an eligible exponent pattern (e.g., 01111110) where flipping a single non-most-significant bit (non-MSB) in the exponent increases its value beyond 1, making it dominant relative to other weights connected to the same feature-layer neuron.
This selection ensures minimal impact on the model’s overall performance, with benign accuracy degradation (BAD) as low as 0.005%.
Following weight identification, the attack optimizes a trigger pattern using gradient descent to amplify the output of the connected feature-layer neuron, balancing attack effectiveness with trigger stealthiness through a bi-objective loss function that incorporates an L1 norm constraint.
The trigger is crafted to be imperceptible, ensuring it activates the backdoor without alerting defenses. Online, a Rowhammer exploit flips the targeted bit, and inputs embedded with the trigger then misclassify to the attacker’s chosen class.
Evaluation Results
Extensive evaluations across datasets including CIFAR-10, CIFAR-100, GTSRB, and ImageNet, using architectures like ResNet-18, VGG-16, PreAct-ResNet-18, and ViT-B-16, show ONEFLIP achieving an average attack success rate (ASR) of 99.6% with negligible BAD averaging 0.06%, outperforming prior methods like TBT, TBA, and DeepVenom that require flipping dozens to thousands of bits.
The attack’s efficiency stems from its direct weight selection algorithm, avoiding iterative optimization searches used in quantized-model attacks, and its adaptability to various DNNs underscores the prevalence of eligible weights in classification layers.
ONEFLIP exhibits strong resilience against backdoor defenses. It evades detection methods like Neural Cleanse, which target training-stage injections, by operating at inference time.
Mitigation via retraining is countered through an adaptive strategy that sequentially flips adjacent bits, maintaining high ASR (up to 99.9%) due to trigger transferability.
Input filtering defenses may struggle against ONEFLIP’s stealthy triggers, which can integrate advanced invisibility techniques.
This vulnerability highlights the need for enhanced hardware mitigations, such as improved DRAM error-correction codes, and periodic model integrity checks to protect AI deployments from such precise, low-overhead threats.
The researchers have released code for replication, emphasizing the critical hardware-software intersection in DNN security.
Find this News Interesting! Follow us on Google News, LinkedIn, and X to Get Instant Updates!
Source link