Critical RCE Flaws in AI Inference Engines Expose Meta, Nvidia, and Microsoft Frameworks

Critical RCE Flaws in AI Inference Engines Expose Meta, Nvidia, and Microsoft Frameworks

Security researchers at Oligo Security have uncovered a series of critical Remote Code Execution vulnerabilities affecting widely deployed AI inference servers from major technology companies.

The flaws affect frameworks developed by Meta, NVIDIA, Microsoft, and open-source projects such as vLLM, SGLang, and Modular, potentially exposing enterprise AI infrastructure to serious security risks.

CVE ID Affected Product Severity Vulnerability Type Patched Version
CVE-2024-50050 Meta Llama Stack Critical Remote Code Execution (RCE) ≥ v0.0.41
CVE-2025-30165 vLLM Critical Remote Code Execution (RCE) ≥ v0.8.0
CVE-2025-23254 NVIDIA TensorRT-LLM Critical (CVSS 9.3) Remote Code Execution (RCE) ≥ v0.18.2
CVE-2025-60455 Modular Max Server Critical Remote Code Execution (RCE) ≥ v25.6

The ShadowMQ Vulnerability Pattern

The vulnerabilities stem from a common root cause, dubbed ShadowMQ, that researchers identified: the unsafe use of ZeroMQ (ZMQ) combined with Python’s pickle deserialization mechanism.

This security flaw spread across multiple AI frameworks through code reuse, with developers copying vulnerable code patterns from one project to another, sometimes line-for-line.

The problem began with Meta’s Llama Stack, where researchers discovered the use of ZMQ’s recv_pyobj() method, which deserializes incoming data using Python’s pickle module.

This creates a critical security issue because pickle can execute arbitrary code during deserialization. When exposed over unauthenticated network sockets, it enables remote attackers to execute malicious code.

The vulnerability pattern appeared across major AI inference engines that form the backbone of enterprise AI operations. NVIDIA’s TensorRT-LLM, PyTorch projects vLLM and SGLang, and Modular Max Server all contained nearly identical unsafe patterns.

In SGLang’s case, the vulnerable code file literally began with “Adapted from vLLM,” demonstrating how security flaws propagated through code copying.

Organizations using these frameworks include major technology companies such as xAI, AMD, Intel, LinkedIn, Oracle Cloud, Google Cloud, Microsoft Azure, and AWS, as well as universities such as MIT, Stanford, UC Berkeley, and Tsinghua University.

Researchers identified thousands of exposed ZMQ sockets communicating unencrypted over the public internet, some of which clearly belonged to production inference servers.

Successful exploitation could allow attackers to execute arbitrary code on GPU clusters, escalate privileges to internal systems, exfiltrate sensitive model data, or install cryptominers.

Meta, NVIDIA, vLLM, and Modular responded quickly with patches that replaced pickle with safer serialization mechanisms like JSON or msgpack and implemented HMAC validation.

However, some projects, including Microsoft’s Sarathi-Serve, remain vulnerable, representing what researchers call “Shadow Vulnerabilities,” known issues without CVEs that persist quietly in production environments.

Organizations using AI inference frameworks should immediately patch to secure versions.

Developers must avoid using pickle or recv_pyobj() with untrusted data, implement authentication mechanisms like HMAC or TLS for ZMQ-based communications, and scan for exposed ZMQ endpoints.

Network access should be restricted by binding to specific interfaces rather than using “tcp://*”, which exposes sockets on all network interfaces.

This discovery highlights a critical lesson for the AI ecosystem: code reuse accelerates development but can also propagate security vulnerabilities at scale.

As AI infrastructure continues expanding, security auditing must keep pace with rapid development cycles.

Follow us on Google News, LinkedIn, and X to Get Instant Updates and set GBH as a Preferred Source in Google.



Source link