NVIDIA shares guidance to defend GDDR6 GPUs against Rowhammer attacks
NVIDIA is warning users to activate System Level Error-Correcting Code mitigation to protect against Rowhammer attacks on graphical processors with GDDR6 memory.
The company is reinforcing the recommendation as new research published by the University of Toronto demonstrates the practicallity of Rowhammer attacks against an NVIDIA A6000 GPU (graphical processing unit).
“We ran GPUHammer on an NVIDIA RTX A6000 (48 GB GDDR6) across four DRAM banks and observed 8 distinct single-bit flips, and bit-flips across all tested banks,” describe the researchers.
“The minimum activation count ( TRH) to induce a flip was ~12K, consistent with prior DDR4 findings.”
“Using these flips, we performed the first ML accuracy degradation attack using Rowhammer on a GPU.”
Rowhammer is a hardware fault that can be triggered through software processes and stems from memory cells being too close to each other. The attack was demonstrated on DRAM cells but it can affect GPU memory, too.
It works by accessing a memory row with enough read-write operations, which causes the value of adjacent data bits to flip from one to zero and vice-versa, causing the in-memory information to change.
The effect could be a denial-of-service condition, data corruption, or even privilege escalation.
System Level Error-Correcting Codes (ECC) can preserve the integrity of the data by adding redundant bits and correcting single-bit errors to maintain data reliability and accuracy.
In workstation and data center GPUs where VRAM handles large datasets and precise calculations related to AI workloads, ECC must be enabled to prevent crucial errors in their operation.
NVIDIA’s security notice notes that researchers at the University of Toronto showed “a potential Rowhammer attack against an NVIDIA A6000 GPU with GDDR6 Memory” where System-Level ECC was not enabled.
The academic researchers developed GPUHammer, an attack method to flip bits on GPU memories.
Although hammering is harder on GDDR6 because of higher latency and faster refresh compared with CPU-based DDR4, the researchers were able to demonstrate that Rowhammer attacks on GPU memory banks is possible.
Researcher Gururaj Saileshwar highlighted to BleepingComputer that GPUHammer can degrade AI model accuracy from 80% to below 1% with a single flip on an A6000 GPU.
Apart from the RTX A6000, the GPU maker also recommends enabling System-Level ECC for the following products:
Data Center GPUs:
- Ampere: A100, A40, A30, A16, A10, A2, A800
- Ada: L40S, L40, L4
- Hopper: H100, H200, GH200, H20, H800
- Blackwell: GB200, B200, B100
- Turing: T1000, T600, T400, T4
- Volta: Tesla V100, Tesla V100S
Workstation GPUs:
- Ampere RTX: A6000, A5000, A4500, A4000, A2000, A1000, A400
- Ada RTX: 6000, 5000, 4500, 4000, 4000 SFF, 2000
- Blackwell RTX PRO (newest workstation line)
- Turing RTX: 8000, 6000, 5000, 4000
- Volta: Quadro GV100
Embedded / Industrial:
- Jetson AGX Orin Industrial
- IGX Orin
The GPU maker notes that newer GPUs like Blackwell RTX 50 Series (GeForce), Blackwell Data Center GB200, B200, B100, and Hopper Data Center H100, H200, H20, and GH200, come with built-in on-die ECC protection, which does not require an intervention from the user.
One way to check if System Level ECC is enabled is to use an out-of-band method that utilizes the system’s BMC (Baseboard Management Controller) and hardware interface software, like the Redfish API, to check the “ECCModeEnabled” status.
Tools like NSM Type 3 and NVIDIA SMBPBI can also be used for configuration, though they require access to the NVIDIA Partner Portal.
A second In-Band method also exists, using the nvidia-smi command-line utility from the system’s CPU to check and enable ECC where supported.
Sailshwar estimates that these recommendations incur up to 10% slowdown for ML inference and 6.5% memory capacity loss across all workloads.
Rowhammer represents a real security concern that could cause data corruption or enable attacks in multi-tenant environments like cloud servers where vulnerable GPUs may be deployed.
However, the real risk is context-dependent, and exploiting Rowhammer reliably is complicated, requiring specific conditions, high access rates, and precise control, making it an attack difficult to execute.
Update 7/12 – Added links to the research and details provided by the researchers.
While cloud attacks may be growing more sophisticated, attackers still succeed with surprisingly simple techniques.
Drawing from Wiz’s detections across thousands of organizations, this report reveals 8 key techniques used by cloud-fluent threat actors.
Source link