ComputerWeekly

Department for Transport shows how its AI system avoids bias


The UK Department for Transport (DfT) has worked with Google Cloud and the Alan Turing Institute to build the Consultation Analysis Tool (CAT) to analyse citizen feedback from public consultations.

A report published in December 2025 by the Alan Turing Institute, notes that the project is part of DfT’s goal to use artificial intelligence (AI) tools to deliver greater efficiency in the department. The CAT tool provides thematic analysis of public consultation feedback, where free text from citizen submissions are mapped onto particular themes using large language models (LLMs).

The report’s authors point out that although it is relatively easy to use LLMs to conduct thematic analysis, “designing systems that align with human preferences, have an appropriate level of human oversight, and have a robust performance evaluation framework is more complex”.

Among the areas the team focused on is demographic bias. The report states that while CAT does not explicitly use demographic variables in any of the LLM prompts, “an LLM may perform worse on responses that are written in poor English or use socio-culturally specific language such as verbosity or slang”.

Given that citizens self-select to participate in public consultations, the report’s authors said: “We decided it was particularly important to invest scarce human resources into assuring the accuracy and quality of the theme generation step.”

They said that having a human-in-the-loop ensures potential AI errors or misinterpretations are identified, and keeps human judgment central to understanding public input. “Our approach formally integrates human oversight in the theme review step and at the analysis and report-writing stage, where users interrogate the CAT-enabled analysis and select representative quotations,” they added.

The CAT uses an LLM pipeline to map each individual response provided in a public consultation to a human-validated theme. The mapping process uses what is known as a majority-vote system, where different LLMs are asked to classify a given response in the public consultation submission to a theme. The theme is only classified to a response if a majority of LLMs agree on the same classification. This is often referred to as LLM-as-a-judge. According to the report’s authors, the technique creates a comprehensive mapping between responses and themes.

While the report states that the CAT was systematically less accurate at mapping themes to responses for specific demographic groups, it also noted that the CAT’s design includes several safeguards to mitigate bias, including exclusion of demographic variables from prompts and the human-in-the-loop review of all CAT-generated themes.

The report’s authors said: “The human-in-the-loop theme review process ensures that the probability of extracting all ‘true’ main themes within the dataset approaches 100% with human review, which is how the CAT is used in practice.”

CAT is built on Google’s Vertex AI platform and uses Gemini models. According to DfT, it is capable of identifying and categorising themes from public feedback in just a few hours – a process that previously often took months. It has already been used to support the analysis of public responses to the Integrated National Transport Strategy and improve driving test booking rules.



Source link