The use of large language models (LLMs) as an alternative to search engines and recommendation algorithms is increasing, but early research suggests there is still a high degree of inconsistency and bias in the results these models produce. This has real-world consequences, as LLMs play a greater role in our decision-making choices.
Making sense of algorithmic recommendations is tough. In the past, we had entire industries dedicated to understanding (and gaming) the results of search engines – but the level of complexity of what goes into our online recommendations has risen several times over in just a matter of years. The massive diversity of use cases for LLMs has made audits of individual applications vital in tackling bias and inaccuracies.
Scientists, governments and civil society are scrambling to make sense of what these models are spitting out. A group of researchers at the Complexity Science Hub in Vienna has been looking at one area in particular where these models are being used: identifying scholarly experts. Specifically, these researchers were interested in which scientists are being recommended by these models – and which were not.
Lisette Espín-Noboa, a computer scientist working on the project, had been looking into this before major LLMs had hit the market: “In 2021, I was organising a workshop, and I wanted to come up with a list of keynote speakers.” First, she went to Google Scholar, an open-access database of scientists and their publications. “[Google Scholar] rank them by citations – but for several reasons, citations are biased.”
This meant trawling through pages and pages of male scientists. Some fields of science are simply more popular than others, with researchers having more influence purely due to the size of their discipline. Another issue is that older scientists – and older pieces of research – will naturally have more citations simply for being around longer, rather than the novelty of their findings.
“It’s often biased towards men,” Espín-Noboa points out. Even with more women entering the profession, most scientific disciplines have been male-dominated for decades.
Daniele Barolo, another researcher at the Complexity Science Hub, describes this as an example of the Matthew Effect. “If you sort the authors only by citation counts, it’s more likely they will be read and therefore cited, and this will create a reinforcement loop,” he explains. In other words, the rich get richer.
Espín-Noboa continues: “Then I thought, why don’t I use LLMs?” These tools could also fill in the gaps by including scientists that aren’t on Google Scholar.
But first, they would have to understand whether these were an improvement. “We started doing these audits because we wanted to know how much they knew about people, [and] if they were biased towards men or not,” Espín-Noboa says. The researchers also wanted to see how accurate the tools were and whether they displayed any biases based on ethnicity.
Auditing
They came up with an experiment which would test the recommendations given by LLMs along various lines, narrowing their requests to scientists published in the journal of the American Physical Society. They asked these LLMs for various recommendations, such as the most important in certain fields or to identify experts from certain periods of time.
While they couldn’t test for the absolute influence of a scientist – no such “ground truth” for this exists – the experiment did surface some interesting findings. Their paper, which is currently available as a preprint, suggests Asian scientists are significantly underrepresented in the recommendations provided by LLMs, and that existing biases against female authors are often replicated.
Despite detailed instructions, in some cases these models would hallucinate the names of scientists, particularly when asked for large lists of recommendations, and would not always be able to differentiate between varying fields of expertise.
“LLMs cannot be seen as directly as databases, because they are linguistic models,” Barolo says.
One test was to prompt the LLM with the name of a scientist and to ask it for someone of a similar academic profile – a “statistical twin”. But when they did this, “not only scientists that actually work in a similar field were recommended, but also people with a similar looking name” adds Barolo.
As with all experiments, there are certain limitations: for a start, this study was only conducted on open-weight models. These have a degree of transparency, although not as much as fully open-source models. Users are able to set certain parameters and to modify the structure of the algorithms used to fine-tune their outputs. By contrast, most of the largest foundation models are closed-weight ones, with minimal transparency and opportunities for customisation.
But even open-weight models come up against issues. “You don’t know completely how the training process was conducted and which training data was used,” Barolo points out.
The research was conducted on versions of Meta’s Llama models, Google’s Gemma (a more lightweight model than their flagship Gemini) and a model from Mistral. Each of these has already been superseded by newer models – a perennial problem for carrying out research on LLMs, as the academic pipeline cannot move as quickly as industry.
Aside from the time needed to execute research itself, papers can be held up for months or years in review. On top of this, a lack of transparency and the ever-changing nature of these models can create difficulties in reproducing results, which is a crucial step in the scientific process.
An improvement?
Espín-Noboa has previously worked on auditing more low-tech ranking algorithms. In 2022, she published a paper analysing the impacts of PageRank – the algorithm which arguably gave Google its big breakthrough in the late 1990s. It has since been used by LinkedIn, Twitter and Google Scholar.
PageRank was designed to make a calculation based on the number of links an item has in a network. In the case of webpages, this might be how many websites link to a certain site; or for scholars, it might make a similar calculation based on co-authorships.
Espín-Noboa’s research shows the algorithm has its own problems – it may serve to disadvantage minority groups. Despite this, PageRank is still fundamentally designed with recommendations in mind.
In contrast, “LLMs are not ranking algorithms – they do not understand what a ranking is right now”, says Espín-Noboa. Instead, LLMs are probabilistic – making a best guess at a correct answer by weighing up word probabilities. Espín-Noboa still sees promise in them, but says they are not up to scratch as things stand.
There is also a practical component to this research, as these researchers hope to ultimately create a way for people to better seek recommendations.
“Our final goal is to have a tool that a user can interact with easily using natural language,” says Barolo. This will be tailored to the needs of the user, allowing them to pick which issues are important to them.
“We believe that agency should be on the user, not on the LLM,” says Espín-Noboa. She uses the example of Google’s Gemini image generator overcorrecting for biases – representing American founding fathers (and Nazi soldiers) as people of colour after one update, and leading to it being temporarily suspended by the company.
Instead of having tech companies and programmers make sweeping decisions on the model’s output, users should be able to pick the issues most important to them.
The bigger picture
Research such as that going on at the Complexity Science Hub is happening across Europe and the world, as scientists race to understand how these new technologies are affecting our lives.
Academia has a “really important role to play”, says Lara Groves, a senior researcher at the Ada Lovelace Institute. Having studied how audits are taking place in various contexts, Groves says groups of academics – such as the annual FAccT conference on fairness, transparency and accountability – are “setting the terms of engagement” for audits.
Even without full access to training data and the algorithms these tools are built on, academia has “built up the evidence base for how, why and when you might do these audits”. But she warns these efforts can be hampered by the level of access that researchers are provided with, as they are often only able to look at their outputs.
Despite this, she would like to see more assessments taking place “at the foundation model layer”. Groves continues: “These systems are highly stochastic and highly dynamic, so it’s impossible to tell the range of outputs upstream.” In other words, the massive variability of what LLMs are producing means we ought to be checking under the hood before we start looking at their use cases.
Other industries – such as aviation or cyber security – already have rigorous processes for auditing. “It’s not like we’re working from first principles or from nothing. It’s identifying which of those mechanisms and approaches are analogous to AI,” Groves adds.
Amid an arms race for AI supremacy, any testing done by the major players is closely guarded. There have been occasional moments of openness: in August, OpenAI and Anthropic carried out audits on each other’s models and released their findings to the public.
Much of the work of interrogating LLMs will still fall to those outside of the tent. Methodical, independent research might allow us to glimpse into what’s driving these tools, and maybe even reshape them for the better.
Source link