HelpnetSecurity

Researchers build an encrypted routing layer for private AI inference


Organizations in healthcare, finance, and other sensitive industries want to use large AI models without exposing private data to the cloud servers running those models. A cryptographic technique called Secure Multi-Party Computation (MPC) makes this possible. It splits data into encrypted fragments, distributes them across two or more servers that do not share information with each other, and lets those servers compute an AI result without either one ever seeing the raw input.

The catch is speed. A standard mid-sized language model that returns a result in under a second when running normally can take more than 60 seconds when processed under MPC. The encryption overhead is that large.

Why existing solutions only go so far

Prior work on private inference has focused on redesigning AI models to be less expensive to run under encryption. Those efforts help, but they all share one structural limitation: every query, regardless of complexity, goes through the same model at the same cost.

In ordinary AI deployment, a common optimization is to route simple queries to a small, fast model and reserve large, expensive models for queries that genuinely need them. This kind of routing is standard practice in plaintext systems. Applying it under encryption is difficult because the routing decision itself would normally require reading the input, and the input must stay encrypted throughout.

What SecureRouter does

Researchers at the University of Central Florida built a system called SecureRouter that brings input-adaptive routing to encrypted AI inference. The system maintains a pool of models at different sizes, from a very small model with around 4.4 million parameters to a large one with around 340 million. A lightweight routing component evaluates each incoming encrypted query and selects which model in the pool should handle it, entirely under encryption. The routing decision is never exposed in plaintext.

The router is trained to weigh accuracy against computational cost, where cost is measured in terms of encrypted execution time rather than the parameter counts typically used in plaintext systems. A load-balancing objective prevents the router from defaulting to a single model for all queries.

An illustration of our proposed secure router framework, divided into an offline training phase and an online inference phase. The diagram simplifies the architecture to focus on the User and the End-to-End Privacy Inference Service Provider (Source: Research paper)

How much faster it runs

Tested against SecFormer, a private inference system that uses a fixed large model, SecureRouter reduced average inference time by 1.95× across five language understanding tasks. Speedups ranged from 1.83× on the most demanding task to 2.19× on the simplest one, reflecting the router’s ability to match model size to query difficulty.

Compared to running a large model on every query regardless of complexity, the average speedup across eight benchmark tasks was 1.53×. On most tasks, accuracy was within a fraction of a percentage point of the large-model baseline. One task involving grammatical analysis saw a more noticeable accuracy drop, suggesting that some highly specialized tasks are sensitive to being handled by a smaller model.

The overhead is small

Adding a routing layer to an encrypted inference system could itself become a bottleneck. In practice, the routing component consumes about 39 MB of memory in a two-server setup, compared to 38 MB for the smallest model in the pool running alone. The largest model in the pool requires around 3,100 MB. The router adds approximately 4 seconds to inference time and 1.86 GB of network communication, figures comparable to running the smallest model by itself.

What this means in practice

The system does not require rebuilding existing infrastructure. It sits on top of existing MPC frameworks and uses standard language model architectures available through common libraries. Queries that are straightforward get resolved quickly using a small model. Queries that require more capacity are escalated to a larger one. The client submitting the query sees only the final result and learns nothing about which model processed the request.

Webinar: The IT Leader’s Guide to AI Governance



Source link