The UK government’s Artificial Intelligence Safety Institute (AISI) has announced it will establish offices in San Fransisco, as it publicly releases its artificial intelligence (AI) model safety testing results for the first time.
Established in the run-up to the UK AI Safety Summit in November 2023, the AISI is tasked with examining, evaluating and testing new types of AI, and is already collaborating with its US counterpart to share capabilities and build common approaches to AI safety testing.
Building on this collaboration, the AISI will open offices in San Fransisco over the summer to further cement its relationship with the US’s own Safety Institute, as well as make further inroads with leading AI companies headquartered there, such as Anthrophic and OpenAI.
With just over 30 London staff, the US expansion will also give the AISI greater access to tech talent from the Bay Area, with plans to hire a team of technical staff and a research director first.
However, there is currently no further information on which specific roles the Institute will be looking to hire for, or how many.
“This expansion represents British leadership in AI in action,” said digital secretary Michelle Donelan. “It is a pivotal moment in the UK’s ability to study both the risks and potential of AI from a global lens, strengthening our partnership with the US and paving the way for other countries to tap into our expertise as we continue to lead the world on AI safety.
“Opening our doors overseas and building on our alliance with the US is central to my plan to set new, international standards on AI safety, which we will discuss at the Seoul Summit this week.”
Safety testing results
The expansion follows the AISI publicly releasing a selection of results from its recent safety testing of five publicly available advanced large language models (LLMs).
The models were assessed against four key risk areas – including cyber security, biology and chemistry, autonomy, and safeguards – with particular focus on how effective the safeguards that developers have installed actually are in practice.
The AISI found that none of the models were able to do more complex, time-consuming tasks without humans overseeing them, and that all of them remain highly vulnerable to basic “jailbreaks” of their safeguards. It also found that some of the models will produce harmful outputs even without dedicated attempts to circumvent these safeguards.
However, the AISI claims the models were capable of completing basic to intermediary cyber security challenges, and that several demonstrated a PhD-equivalent level of knowledge in chemistry and biology (meaning they can be used to obtain expert-level knowledge and their replies to science-based questions were on par with those given by PhD-level experts).
The models also underwent “agent” evaluations to test how well they can autonomously perform tasks such as executing code or navigating websites. It found that while models often made small errors (such as syntax errors in code) during short-horizon tasks, they were unable to sufficiently complete long-horizon tasks that required a deeper level of planning to execute.
This is because, despite making good initial plans, the models were unable to correct their initial mistakes; failed to sufficiently test the solutions devised; and often “hallucinated” the completion of sub-tasks.
Prompt attacks
While developers of LLMs will fine-tune them to be safe for public use (meaning they are trained to avoid illegal, toxic or explicit outputs), the AISI found these safeguards can often be overcome with relatively simple prompt attacks.
“The results of these tests mark the first time we’ve been able to share some details of our model evaluation work with the public,” said AISI chair Ian Hogarth. “Our evaluations will help to contribute to an empirical assessment of model capabilities and the lack of robustness when it comes to existing safeguards.
“AI safety is still a very young and emerging field,” he said. “These results represent only a small portion of the evaluation approach AISI is developing. Our ambition is to continue pushing the frontier of this field by developing state-of-the-art evaluations, with an emphasis on national security-related risks.”
However, the AISI has declined to publicly state which models from which companies it has tested, and is clear that the results only provide a snapshot of model capabilities, and do not designate systems as “safe” or “unsafe” in any formal capacity.
The release of the results follows on from the AISI making its Inspect evaluations platform publicly available in early May 2024. It aims to make it easier for a much wider range of groups to develop AI evaluations and strengthen the testing ecosystem.
Limits of AISI testing
In a blog post published 17 May 2024, the Ada Lovelace Institute (ALI) questioned the overall effectiveness of the AISI and the dominant approach of model evaluations in the AI safety space. It also questioned the voluntary testing framework that means the AISI can only gain access to models with the agreement of companies.
It said that while evaluations have some value for exploring model capabilities, they are not sufficient for determining whether AI models and the products or applications built on them are safe for people and society in real-world conditions.
This is because of the technical and practical limitations of methods such as red teaming and benchmarking, which are easy to manipulate or game through either training the models with the evaluation dataset or strategically using which evaluations are used in the assessment; and the iterative nature of AI, which means small changes to models could cause unpredictable changes in its behaviour or override the safety features in place.
The ALI added that the safety of an AI system is also not an inherent property that can be evaluated in a vacuum, and that this requires models to be tested and assessed on its impacts in specific contexts or environments. “There are valuable tests to be done in a lab setting and there are important safety interventions to be made at the model level, but they don’t provide the full story,” it said.
It added that all of these issues are exacerbated by the AISI’s voluntary framework, which it said prohibits effective access to models (as shown by recent reporting in Politico that revealed three of the four major foundation model developers have failed to provide the agreed pre-release access to the AISI for their latest cutting-edge models).
“The limits of the voluntary regime extend beyond access and also affect the design of evaluations,” it said. “According to many evaluators we spoke with, current evaluation practices are better suited to the interests of companies than publics or regulators. Within major tech companies, commercial incentives lead them to prioritise evaluations of performance and of safety issues posing reputational risks (rather than safety issues that might have a more significant societal impact).”
The ALI added that the AISI is also powerless to prevent the release of harmful or unsafe models, and is completely unable to impose conditions on release, such as further testing or specific safety measures.
“In short, a testing regime is only meaningful with pre-market approval powers underpinned by statute,” it said.
However, according to a blog post of its own, the AISI said it’s “acutely aware” of the potential gap between how advanced AI systems perform in its evaluations, versus how they may perform in the wild.
“Users might interact with models in ways that we have not anticipated, surfacing harms that our evaluations cannot capture,” it said. “Further, model evaluations are only part of the picture. We think it is also important to study the direct impact that advanced AI systems may have on the user. We have research underway to understand and address these issues.
“Our work does not provide any assurance that a model is ‘safe’ or ‘unsafe’. However, we hope that it contributes to an emerging picture of model capabilities and the robustness of existing safeguards.”