Dutch researcher’s AI breakthrough tackles the structured data paradox

Dutch researcher’s AI breakthrough tackles the structured data paradox

Organisations sit on vast quantities of structured data in relational databases and spreadsheets. It’s organised and searchable, yet when it comes to extracting insights, we barely scratch the surface.

“We don’t know what we don’t know,” says Madelon Hulsebos, researcher at the Dutch Centrum Wiskunde & Informatica (CWI), the national research institute for mathematics and computer science in the Netherlands.

Hulsebos began her career as a data scientist, and noticed that highly paid specialists repeatedly performed the same manual tasks: cleaning tables, extracting features and linking datasets.

During her PhD at the University of Amsterdam and postdoctoral research at the University of California, Berkeley, she developed “table representation learning” – enabling artificial intelligence (AI) to understand what tables mean rather than simply searching them. She now leads the Table Representation Learning Lab at CWI, working on this challenge with three PhD students, two postdocs and six master’s students.

“As a data scientist, I experienced how incredibly difficult and frustrating it is to find relevant datasets, for instance, to train machine learning models,” says Hulsebos.

Much of the data exists but sits scattered or buried deep in large, complex tables.

Using funding including an NWO AiNed Fellowship Grant – a National Growth Fund programme to attract and retain top AI researchers at Dutch universities and research institutes – she established the CWI lab with the goal of democratising insights from structured data. “The aim is essentially that, based on questions people have – business users, analysts – we can automatically retrieve the relevant data across different systems and provide answers,” says Hulsebos.

Information to insight

The project for which Hulsebos received the grant is called DataLibra, which runs from 2024 to 2029. Over those five years, the researcher and her team aim not only to gain insights, but also to build concrete tools that organisations can use to extract more value from their data.

“It should be as simple to query data within your organisation as it is to perform a Google search,” she says. “AI can play a major role here because it enables the use of natural language instead of requiring people to have knowledge of programming, business intelligence and relational databases.”

That AI can play a role here seems contradictory. For years, AI has been positioned as the solution for unstructured data such as text, images and video, while structured data in tables was supposedly easy to search. But the problem isn’t the structure itself, says Hulsebos, but its diversity.

Each system uses different column names and logic, causing traditional methods such as SQL and pattern matching to fall short. “You need to understand what columns mean, not just what they’re called,” she adds. “And that’s where machine learning excels, because it can generalise and understand context.”

Retrieving the right dataset is only the beginning. “We call that information retrieval, but we want to move towards insight retrieval,” says Hulsebos. “Once you’ve found the relevant tables, you often still need to combine, link or process them before you can extract an insight.”

That makes the challenge more complex than simple searching. At the same time, she emphasises that full automation isn’t the goal. “Nobody can simply trust an insight,” she says. “You must always be able to explain why an answer is the right answer for that specific question. Transparency and iteration are crucial in that regard.”

Automating data science

When asked how table representation differs from traditional business intelligence, Hulsebos responds: “Data scientists do more than traditional BI [business intelligence] tasks such as reports and dashboards, they also train machine learning models. Our goal is also to develop tools to automate repetitive, everyday tasks such as data cleaning, validation or data transformation.”

It’s often said that data science is 80% data work and 20% modelling. “We want to automate that 80% as much as possible, so data scientists can focus on the other part where they think about critical aspects of problems, such as ethical questions,” she says.

Beyond that, Hulsebos wants to give all non-data scientists more capabilities. “And this does indeed touch on business intelligence, but at present, it still takes considerable time and money to do it yourself, because you still need someone who builds dashboards and understands what the real insight need is,” she says.

“But often the person with a problem doesn’t see which data might help. And the person who manages the data doesn’t understand the problem. That gap is the issue. By ensuring that relational databases can be queried in plain language without requiring knowledge of SQL or underlying data structures, you can already generate far more insights.”

Many software suppliers currently claim to have such AI features in their products, but Hulsebos remains unimpressed. “It’s very easy to build something that doesn’t necessarily always work well,” she says. “There are plenty of fancy demos of agentic data scientists or analysts, but I’ve examined the benchmarks and the success rate is often zero. It all sounds wonderful, but to actually get there, we still have much work to do.”

Hulsebos emphasises the importance of robustness and transparency in systems. “You can ask an LLM [large language model] a question and it will always provide an answer, but it must also be able to convince you that it’s the right answer,” she says. “That transparency and context are necessary for adoption.”

Context determines data sensitivity

Precisely that transparency and context proved crucial in a project Hulsebos recently conducted for the United Nations (UN). It illustrates not only why existing tools fall short, but also what’s needed to make table representation learning work in practice.

The collaboration came about when Hulsebos, once on the academic path, approached the Humanitarian Data Centre. “The humanitarian aid aspect really drives me,” she says. “I saw that from my position I could achieve societal impact by collaborating with the UN on scientific research questions.”

The first joint project focused on detecting sensitive data, a challenge that directly connects to her earlier Massachusetts Institute of Technology research into what tables mean. The Humanitarian Data Centre facilitates local organisations in providing aid during conflicts, natural disasters and other crises. Via their Humanitarian Data Exchange platform, these organisations share datasets that others can use for planning and coordination.

“The problem is that much of that data comes from conflict zones and contains extremely sensitive information,” says Hulsebos. “But what’s sensitive here differs fundamentally from what many current systems classify as ‘sensitive’. They typically focus on personal data such as names and addresses, but here we look further, namely at data that can be dangerous in a specific context. Consider, for example, detailed coordinates of hospitals in conflict zones. Those could enable new attacks. You want to filter out such datasets before they become publicly accessible.”

Together with master’s student Liang Telkamp, Hulsebos developed two mechanisms to tackle this. The first mechanism incorporates the full data context in its reasoning, dramatically reducing false positives. “Existing tools detect an address and conclude it’s sensitive,” she says. “But a company address may be perfectly public – not sensitive. You need to look at the context in which something is mentioned, not just the data type.”

The second mechanism – “retrieve then detect” – links datasets to relevant policies and protocols applicable at that moment. “When a conflict breaks out somewhere, what’s sensitive changes,” says Hulsebos. “Your system must be able to retrieve that new context and incorporate it into its assessment.”

That dynamic approach proves essential. A dataset about hospitals in the Netherlands requires a different assessment than the same data from Gaza. “It’s not only situational, but also time-dependent,” she says. “Information that wasn’t sensitive five years ago might suddenly be so now. You must be able to reason about the context in which data is used.”

The results demonstrate that the approach works, particularly for detecting personal information, but the system also proves valuable for situationally sensitive data. “The Quality Assessment Officers at the UN found the contextualised explanations from the LLMs enormously useful,” says Hulsebos. “Those information sharing protocols are extremely long documents. That the system extracts the relevant rules and explains why something is sensitive was already highly insightful for them.”

Telkamp’s work – she now works at the UN on the integration – was recently awarded the Amsterdam AI Thesis Award, partly due to its societal impact.

Making data insights more broadly accessible

The UN project illustrates a specific problem, but the underlying challenge – how to make data accessible and comprehensible – plays out in every organisation. Understanding data sensitivities in an organisation’s context is always useful, says Hulsebos. Moreover, it’s important to realise that LLMs are trained on all sorts of datasets scraped from the internet, including data sharing portals.

“It’s so important to ensure that no sensitive data ends up on those portals, because once it’s in those models’ training data, it doesn’t come out,” she says.

But organisations also fail to fully utilise the data they collect. “We don’t know what we don’t know,” says Hulsebos. “People ask questions about things they already know the data exists for. But how many insights are you missing because you don’t know certain data even exists? Or because you don’t know which datasets you should combine to get an answer?”

She therefore wants to make visible what people don’t yet know about their data and make access to data and insights more broadly available in organisations. “For a CEO, it’s extremely useful when everyone within their organisation has direct access to insights that help them make important decisions,” says Hulsebos.

She describes first having to mobilise the data science or business intelligence department as “a barrier for someone in sales, logistics or finance to quickly ask an important question”.

“By the time a BI dashboard or SQL query is delivered, the insight is no longer relevant,” says Hulsebos.

That requires AI-powered systems that democratise insights from structured data, enabling people to act and decide directly. “Speed to insight is the key factor,” she adds.

Concrete solutions for business are in development. One of her PhD students is building tools to automate the retrieval aspect and support structured query language generation. “We’re making all those tools available as open source,” says Hulsebos. “We’re trying to make things genuinely usable, not just publish them. Within the next two months, first versions will be available.”

One example is DataScout, a tool she developed during her time at the University of California, Berkeley. The system helps users find datasets based on their task or problem, rather than keywords. “Task-based search with LLMs that think proactively proves enormously useful,” says Hulsebos.

In user studies, DataScout proved faster and more effective than traditional data platforms with keyword search. “As a data scientist, it could easily take two weeks to a month before you’d gathered the right data for a machine learning model,” she says.

That such systems still aren’t standard in data platforms, whilst they could save weeks of search work, still surprises Hulsebos. “The goal is that everyone in an organisation – from CEO to sales staff – can ask questions of their data directly,” she says. “Without intermediaries, without waiting time.”



Source link