Real-world testing of live facial recognition (LFR) systems by UK and European police is a largely ungoverned “Wild West”, where the technology is tested on local populations without adequate safeguards or oversight, say university researchers.
According to a comparative study of LFR trials by law enforcement agencies in London, Wales, Berlin and Nice, although “in-the-wild” testing is an important opportunity to collect information about how artificial intelligence (AI)-based systems like LFR perform in real-world deployment environments, the trails conducted so far have failed to take into account the socio-technical impacts of the systems in use, or to generate clear evidence of the operational benefits.
The paper – published by Karen Yeung, an interdisciplinary professorial fellow in law, ethics and informatics at Birmingham Law School, and Wenlong Li, a research professor at Guanghua Law School, Zhejiang University – added that there needs to be clear guidance and governance frameworks in place to ensure trials are conducted in “an epistemically, legally and ethically responsible manner”.
Without this, the authors said “we worry that such tests will be little more than ‘show trials’ – public performances used to legitimise the use of powerful and invasive digital technologies in support of controversial political agendas for which public debate and deliberation is lacking, while deepening governmental reliance on commercially developed technologies which fall far short of the legal and constitutional standards which public authorities are required to uphold”.
To ensure there is responsible scrutiny of LFR systems, Yeung and Li said it is “vital” to properly consider “the highly powerful, intrusive and scalable properties” of LFR, particularly given its capacity for misuse and overreach in ways that interfere with rights to privacy, freedom of expression, freedom of assembly, and to go about one’s lawful activity in public without unjustified interference by the state.
Given the scope for interference with people’s rights, the authors said that evidence of the technology’s effectiveness in producing its desired benefits “must pass an exceptionally high threshold” if police want to justify its use.
They added that without a rigorous and full accounting of the technology’s effects – which is currently not taking place in either the UK or Europe – it could lead to the “incremental and insidious removal” of the conditions that underpin our rights and freedoms.
“If we are to take seriously our basic freedom to go about our lawful business in public spaces without state interference, and the opportunity for self-creation and development which this freedom affords, then we must be vigilant,” they said.
“This includes a need to test these technologies in a responsible manner to demonstrate that they do in fact generate valuable social benefits, given their economic and other costs, and to undertake such tests responsibly.”
Problematic trials
Highlighting the example of the Met’s LFR trials – conducted across 10 deployments between 2016 and 2020 – Yeung and Li said the characterisation of these tests as “trials” is “seriously questionable” given their resemblance to active police operations.
“Although described as ‘trials’ to publicly indicate that their use on these occasions did not necessarily reflect a decision to adopt and deploy FRT on a permanent basis, they were decidedly ‘real’ in the legal and social consequences for those whose faces triggered a match alert,” they wrote, adding that this means the trials were limited to the systems operational performance in relation to a specific organisational outcome (making arrests), rather than attempting to evaluate its wider socio-technical processes and impacts.
“The primary benefit typically assumed to flow from using live FRT is the enhanced capacity to identify and apprehend wanted individuals whose facial images are stored on FRT watchlists. Whether it actually generates these organisational benefits in real-world settings remains unknown and has not been systematically tested,” they said.
The authors further added that although translating automated LFR alerts into lawful arrests requires the successful integration of multiple technical, organisational, environmental and human components, this has not been adequately grasped by the Met.
For example, they noted that given the potential for erroneous matches, the mere generation of an LFR match alert is not in itself enough to constitute reasonable suspicion (which UK police are required to demonstrate in order to legally stop and detain people).
“Although police officers in England and Wales are entitled to stop individuals and ask them questions about who they are and what they are doing, individuals are not obliged to answer these questions in the absence of reasonable suspicion that they have been involved in the commission of a crime,” they wrote.
“Accordingly, any initial attempt by police officers to stop and question an individual whose face is matched to the watchlist must be undertaken on the basis that the individual is not legally obliged to cooperate for that reason alone.”
‘Presumption to intervene’
However, despite requiring reasonable suspicion, Yeung and Li added that previous evaluations of the London trials identified a discernible “presumption to intervene” among officers, meaning it was standard practice for them to engage an individual if told to do so by the algorithm.
Given the nuanced dynamics at play in the operation of LFR, the paper highlights the need for clear organisational policy, operational protocols and proper officer training.
Yeung and Li also noted that while the detection of wanted individuals was the stated goal of the London trials, the force and its representatives have claimed a range of ancillary benefits, such as the disruption and deterrence of crime, that have not yet been proven.
“Deterrence effects are often difficult to substantiate, even when apparent, and although LFR was justified on the basis of offering ‘reassurance to the community by providing a visible symbol that crime was being tackled’, no evidence was provided to support such claims that LFR had a positive effect in allaying the fear of crime,” they said.
Yeung and Li added that even when independent evaluations were conducted by the National Physical Laboratory (NPL), its 2019 report made “the bold statement that ‘the trials indicate that LFR will help the MPS stop dangerous people and make London safer’ without offering any concrete evidence concerning how this conclusion is arrived at”.
The authors were similarly critical of South Wales Police (SWP) – which conducted 69 “operational” trials between 2017 and 2020 – noting that Cardiff University researchers involved in creating an independent evaluation of the tests were “unable to quantify what, if any, impact the technology had on crime prevention”.
They further highlighted the lack of rigour in SWP’s watchlist creation, particularly regarding the quality of the images included and the overall size of the lists, which varied widely.
“Even the Cardiff researchers evaluating the trials stated that they were unable to identify precisely how and why the size and composition of the watchlist mattered, although ‘it does seem to make a difference’, while recommending that ‘decisions about watchlist size and composition’ be made available for public scrutiny,” they said, adding that the report concluded with a call for a “realistic and evidence-led approach” to police LFR evaluations.
Commenting further on the “operational” trial approach of the Met and SWP, the authors added that “false negatives could not be identified and measured” as a result.
“In other words, the number of individuals whose facial images were included on the FRT watchlist and who passed in front of the camera without triggering a match alert (i.e., the ‘ones that got away’) could not be detected and measured,” they said.
“Hence, the results generated from the London and Welsh trials were not in fact indicators of software-matching accuracy, for they could only generate data concerning recorded true and false positives.”
Commenting further on the validity of NPL evaluations, Yeung and Li noted that it does not mention anything about false negatives, “presumably because this data could not be collected given that the tests were designed to identify and apprehend ‘live’ targets”.
Socio-technical evaluations required
Unlike the British LFR trials, Yeung and Li highlighted how the trials in Nice and Berlin used volunteers who had provided informed consent, rather than random unsuspecting members of the public, and sought to assess their functional performance in a technical, rather than operational, sense.
However, each case also had its own problems. In Nice, for example, just eight volunteers were involved, while it was later revealed that volunteers in Berlin – who were also mixed with passersby who did not consent – were not informed of “the full range of personal data collected about them”, which included their speed and temperature recorded by transmitters.
Despite the more modest aims of the Berlin and Nice trials to test the technical functionality of the system – compared with the trials conducted in London and Wales that sought to test the systems in live operational environments – the authors were clear that the Berlin and Nice trials similarly failed to take into account the wider socio-technical systems that LFR is embedded in.
“None of these trials generated clear evidence that the technology actually delivered real-world benefits in the form of improved operational efficiency in locating, identifying and apprehending criminal suspects by law enforcement authorities, nor by how much,” they said.
Yeung and Li concluded that while the use of LFR by police generates “a host of significant yet thorny legal and ethical dangers”, these have not yet been grasped by the policing agencies wishing to deploy them.
“At least for highly rights-intrusive technologies, particularly biometric surveillance systems that can be deployed remotely, such as live FRT, we must insist upon evidence of real-world benefits such that their adverse impacts on fundamental rights and democratic freedom can be justified in accordance with legal tests of necessity and proportion,” they said.
“In other words, establishing the trustworthiness of live FRT requires more than evidence of the software’s matching capabilities in the field. It also requires testing whether the software can be integrated successfully into a complex organisational, human and socio-technical system to generate the expected benefits to the deploying organisation that can be plausibly regarded as socially beneficial, in a manner that is consistent with respect for fundamental rights.”
The Met and SWP respond
Responding to the paper, a Met Police spokesperson said: “We believe our use of LFR is both lawful and proportionate, playing a key role in keeping Londoners safe. We recognise early operational testing of LFR in 2016 to 2020 was limited.
“This is why we commissioned the National Physical Laboratory to carry out independent testing. This has helped the Met understand how to operate the technology in a fair and equitable way.”
They added that as more operational experience was gained, the success of deployments increased. “At the end of 2023, we moved to a strategy of deploying LFR to crime hotspots across London,” they said. “We work very closely with stakeholders and the community, and are seeing strong support for the development of this technology.”
The force added that so far in 2025, there have only been 12 false alerts, while the technology has scanned more than 2.5 million people’s faces, and that since the start of 2024, LFR’s use has led to the arrest of over 1,300 individuals.
It also highlighted the Public attitude surveys commissioned by the Mayor’s Office for Police and Crime in the final quarter of 2024, which showed that 84% of those surveyed supported use of the technology to specifically identify serious and violent criminals, locate those wanted by the courts, and locate those at risk to themselves.
Responding to the paper, the SWP said that LFR testing was conducted independently by the NPL.