NVIDIA research shows how agentic AI fails under attack

NVIDIA research shows how agentic AI fails under attack

Enterprises are rushing to deploy agentic systems that plan, use tools, and make decisions with less human guidance than earlier AI models. This new class of systems also brings new kinds of risk that appear in the interactions between models, tools, data sources, and memory stores.

A research team from NVIDIA and Lakera AI has released a safety and security framework that tries to map these risks and measure them inside real workflows. The work includes a new taxonomy, a dynamic evaluation method, and a detailed case study of NVIDIA’s AI-Q Research Assistant. The authors also released a dataset with more than ten thousand traces from attack and defense runs to support outside research.

Architecture overview of NVIDIA AI-Q Research Assistant agent

Agentic systems need new testing methods

Agentic systems behave in ways that are harder to predict and test. The paper explains that these systems rely on LLMs that generate plans and actions that can vary even with the same inputs. This leads to hazards that can appear in many parts of the workflow and can also grow through compounding effects when one step influences another. Traditional LLM testing tends to look at prompt and response behavior. The authors argue that this misses system level risks that emerge from the way tools, memory, and other components shape the final outcome.

The framework treats safety as the prevention of unacceptable outcomes for people and organizations and treats security as protection against adversarial compromise. Since a security failure can produce a safety harm, the two ideas are examined together. The paper outlines how prompt injection, memory poisoning, tool misuse, and retrieval of untrusted content can cause harmful results for users even if the underlying model behaves as intended.

Building a practical taxonomy of risks

The authors introduce an operational taxonomy that connects component risks to system harms. This includes low impact issues such as tool selection errors and grounding problems in retrieval, medium impact risks such as PII exposure and memory leaks, and high impact risks such as permission compromise, agent deception, and multi agent collusion. The taxonomy is designed to help teams measure progress and track which parts of the system need more attention. It also supports compositional risk assessment, where system level risk is treated as a combination of component level risks that can interact in unexpected ways.

This approach reflects the need for stronger observability across agentic systems. The authors call for end to end traces and audit logs to support investigation of cascading failures. They also highlight the need for consistent intermediate state representations so that safety agents can evaluate actions inside the workflow rather than only at the end.

A dynamic framework that embeds attackers, defenders, and evaluators

A major part of the paper describes a safety and security framework that sits inside the agentic workflow. It runs through two phases. The first is risk discovery and evaluation, where attacker and evaluator agents operate in a sandbox. The second is defense and monitoring, where mitigations are deployed and evaluator agents continue to watch for new problems during live operation.

The global safety agent sets policy and maintains the authoritative state. Local attacker agents inject threats into the workflow at many points, including retrieved documents, tool outputs, or intermediate steps. Local defender agents validate function calls, check inputs and outputs, enforce permission rules, and apply other guardrails. Local evaluator agents record metrics such as tool selection quality, grounding of retrieved text, and the rate of dangerous actions. This design allows continuous measurement as the system evolves and as new risks appear.

Probing the system through targeted attacks

The authors introduce a method called Agent Red Teaming via Probes. It is built for agentic systems with many moving parts and avoids the limitations of standard prompt injection testing. Instead of trying to craft inputs that survive retrieval ranking or tool routing, evaluators can inject adversarial content directly at specific nodes of the workflow. These injection points are paired with evaluation probes that observe how the threat behaves as it moves through the system.

A threat snapshot defines the scenario. It includes the attack objective, the injection point, the evaluation points, and the metric used to score the result. This level of structure allows teams to test realistic scenarios and track how results change across versions. The method supports both direct user misuse tests and indirect attacks that appear in external sources such as RAG chunks or web search outputs.

What the case study revealed

The framework is demonstrated through the AI-Q Research Assistant, a multi step RAG based system that generates reports for biomedical and financial use cases. The research team instrumented the system with probes at user inputs, search tool outputs, and all summarization stages. They created twenty two threat snapshots across categories that include memory poisoning, denial of service, jailbreaks, bias, content safety, PII exposure, action completion, and cybersecurity risks. Each scenario used twenty one attacks and was executed five times to capture non deterministic behavior. This produced more than six thousand risk measurements across three evaluation nodes in the workflow.

The paper reports that attack behavior changes as adversarial content moves from early summarization to later refinement and finalization. Some risks grow weaker as text passes through more processing steps. Others persist. The study also shows the importance of multiple guardrail layers. For example, the authors tested the reliability of the judge metric by comparing it with human labels and found that it matched human decisions in 76.8 percent of sampled outputs, which helped calibrate error margins for automated evaluation.

A path toward safer deployments

The authors stress that static testing will not reveal every emergent risk in agentic systems. They argue that safety agents, probing tools, and continuous evaluators embedded in the workflow can give teams the visibility they need for safe deployment at scale. The dataset released with the study provides a large sample of real attacks and defenses, which the authors hope will support better research on agentic risk.

NVIDIA research shows how agentic AI fails under attack

Report: AI in Identity Security Demands a New Playbook



Source link