In 2019, a group of researchers at AI Sweden received funding from the Swedish Innovation Agency (Vinnova) for a project called Language model for Swedish Authorities. The goal was to produce language models that could be used primarily by the public sector and made available for use by the private sector.
A language model is a machine learning model that learns language to solve processing tasks. A foundational language model is a large example that has been trained on huge amounts of data and has general capacities that can be applied to a wide range of language processing tasks. It has what is called zero-shot learning capacities, which means the linguistic capabilities of the model can be used to solve new tasks.
Swedish researchers had already been working on language models for several years. Very early on, the researchers thought about which sectors of society would be the fastest to take up this type of technology. They landed on the idea that it would be the public sector in Sweden because that’s where you find the most prominent users of text data in Swedish, with most companies in the private sector relying much more on English-language text data.
“We needed models we could work on to do research on and modify to suit the needs of Swedish society,” said Magnus Sahlgren, head of research in Natural Language Understanding (NLU) at AI Sweden – and former heavy metal guitarist. “The foundation models from Google, for example, are not publicly accessible. That’s one big reason we are building our own.”
But another reason for building language models has to do with sovereignty. Foundation models are essential components of a lot of language applications. A country could be vulnerable if they depend too much on the private sector for such a fundamental resource – especially when the private companies are based outside Sweden. To close this gap, the research team decided to develop their own models for Swedish.
Along came GPT-3
About a year into the project, GPT-3 was released, causing huge disruption in the field of natural language processing (NLP). This was the largest language model the world had ever seen, with 175 billion parameters. All machine learning models can be thought of as a series of linear algebra equations, with coefficients, or weights, that can be modified to produce an output given a certain set of inputs. The number of weights that can be tweaked in a model is often referred to as the number of parameters.
Inspired by GPT-3, the researchers at AI Sweden, who had already been working on language models, started thinking about how they could accomplish something like GPT-3 in a small country. They put together a consortium of different organisations that could help build foundation models. The consortium included the Research Institutes of Sweden (RISE) and the Wallenberg AI, Autonomous Systems and Software programme.
By association with Wallenberg, the consortium gained access to the Swedish supercomputer Berzelius, which was specifically designed to help solve AI problems. The consortium also works closely with NVIDIA, who provide the hardware and software to power the models.
“The ultimate goal of our research project – and now of the consortium – is to determine whether home grown language models can provide value in Sweden,” said Sahlgren. “We are completely open to a negative answer. It might prove to be the case that our resources are too limited to build foundation models.”
The challenges of running a large project
The new goal meant the team had to learn how to run large scale projects. They also had to make decisions on which type of data to use and how to process the data to build a basic linguistic foundation. And very importantly, they had to figure out how to make the best use of the supercomputer they have access to.
“We want to use the computer resources in optimal way to arrive at an optimal model,” said Sahlgren. “We’ve never done this and neither has anybody else – not for the Swedish language. So, we must learn by doing, which means we will iterate several times and produce more than one version of our model.
“We have trained models of various sizes, ranging from 126 million parameters up to our largest model with 40 billion parameters. The model is a text-only model. Other groups in other parts of the world are starting to integrate other modalities, including images and speech.”
Berzelius in Linköping University is by far the most powerful computer in Sweden, and it is the only supercomputer in Sweden dedicated to AI. Because of the high demand, AI Sweden cannot gain access to the full cluster and instead have been given access to a third of the cluster, which takes two to three months to train the largest models.
But the main bottleneck for the Swedish researchers is data. Because of the limited number of speakers in the world, there isn’t much online text in Swedish. The researchers worked around this problem by taking advantage of the fact that Swedish is typologically similar to the other languages in the North Germanic language family. By taking data in Swedish, Norwegian, Danish, and Icelandic they have access to sizable amounts of data that can be found in open data collections online.
“We used derivatives of common crawl for example, and other published datasets, such as the Norwegian Colossal Corpus and OPUS,” said Sahlgren. “We collected all those data sets, and then we also took some high-quality datasets in English. We did that because we’re interested in seeing if we can benefit from transfer learning effects from the English data to the Swedish and Norwegian languages. We are already starting to see those type of effects with our models.”
An example of transfer learning is training AI Sweden’s models to summarise documents by using English data that includes documents and summaries of the documents. The Swedish researchers are hoping their model will learn the general competence of summarising text from the English data.
Another example of transfer effects is training models the general task of translating. “You can train on a couple of language pairs and then all of a sudden your machine translation system will be able to translate between pairs that you haven’t had any training data for,” said Sahlgren. “It’s an established effect in the field that no one really understands.
“We use a form of supervised learning. The only training objective is to try to predict the next word. We feed it all this text and for every word it sees it tries to predict the next word. It has a specific context window where I think in our case it has a few thousand tokens that it can have in the context. That’s quite a long context when it tries to predict the next word.”
There are initiatives in other parts of Europe for training models on other languages and language families. All the projects have the same challenges, including getting access to data, handling the data once you have it, and initialising the model.
AI Sweden trains its model from scratch. Researchers train a completely empty model using the organisation’s own data, but you can also use an existing model and then continue training with your own specific data – for example, AI Sweden’s model, which is a Nordic model, could be used as a starting point to train a model that is specifically Icelandic.
The consortium started training its model six months ago and has so far produce five versions, which are available on Hugging Face. But it doesn’t stop there. They have new architectures and new ideas for the next few generations of language models, which will include a multimodal language model.
A matter of investment
Now would not be a good time for Sahlgren to dust off his guitar and get the heavy metal band back together. There’s just too much to do in NLP – right now and for the foreseeable future. This is evidenced by how much major tech players are investing in it.
Microsoft, for example, is investing $10bn in Open AI, the maker of ChatGPT, and it is already putting GPT functionality into their production systems, such as the Office Suite and Teams. Microsoft and other large tech companies are putting this much money into NLP because they see the commercial value.
Sweden is trying a similar approach, but on a smaller scale. The number of Swedish speakers is much smaller than the number of English speakers, and the computing power available to train and run language models in Sweden is also much smaller. But researchers are already working on ways of making the model available to application developers.
“Currently, we released the models openly and the current models can be hosted locally by having access to powerful GPU’s,” said Sahlgren. “Most organisations probably do not have that resource. It will get even more challenging over time. For the largest models, it will require a substantial amount of hardware to run.”
Running language models takes less computing power than is needed to train them, but it still requires substantial processing – for example, two or three nodes on Berzelius. AI Sweden is exploring the idea of creating a Swedish national infrastructure for hosting Swedish Foundation models. Using public resource would help bolster sovereignty – at least for the time being.
“We haven’t yet figured out a good solution for hosting these models in Sweden,” said Sahlgren. “You need a major player that can put investments into this. It’s going to require a dedicated datacentre to run and serve the very large models. You need to have machine learning operations and personnel that work on the supercomputers and, currently, there is no organisation in Sweden that can do that.”
Just how intelligent are the language models?
As the general public explores the power of ChatGPT, the question often comes up about how intelligent the language models really are. “I may be a little strange,” said Sahlgren, “but I think language models do really understand language. What I mean is that language models can at least seemingly handle the linguistic signal in exactly the same way as we do.
“The current language models can handle all kinds of language processing tasks. Currently, when we try to evaluate these models, they are on par with humans on the test sets we use, and they also exhibit emergent phenomena like that they can be creative – they can produce text that has never been produced before.”
The idea isn’t exactly new. In the 1960s a model called Eliza was developed to pose as a psychoanalyst. But Eliza could only do one thing – act as a psychiatrist. This generated a lot of interest for a short time in the 1960s, but people quickly caught on to the lack of real humanity.
Natural language processing and natural language understanding have come light years since the 1960s – and the rate of change has picked up recently. Stanford Business School researcher Michal Kosinski published a provocative “working paper” in March 2023, claiming that a series of breakthroughs have occurred in recent years with successive versions of GPT.
The breakthroughs can be measured by theory of mind tests – tests that indicate whether a person (or machine) recognises that other people (or machines) have a different mindset than them. The paper is called Theory of mind may have spontaneously emerged in large language models.
According to Kosinski, prior to 2020, language models showed virtually no ability to solve theory of mind tasks, but successive models have scored better. The most recent version, GPT-4, was released in March 2023. GPT-4 solved 95% of the theory of mind tasks at the level of a seven-year-old child.