Upgraded Arabic large language model is twice as big


Just under three months since its first release of Jais, the consortium that released what it calls the world’s most powerful large language model (LLM) for Arabic has finished training a second version, which is more than twice as big. The first model was based on 13 billion parameters and is now referred to as Jais-13B; the second model uses 30 billion parameters and is called Jais-30B.

The consortium is made up of three partners: Core42, a subsidiary of G42 in the United Arab Emirates (UAE); Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), a university based in the UAE offering graduate degrees in artificial intelligence (AI); and Cerebras, a California-based company that makes supercomputers specifically designed to speed the learning phase of AI models. The partners released the most recent model, Jais-30B, on 8 November 2023. 

“Jais-13B was a prototype that allowed us to get feedback from users,” said Andrew Jackson, executive vice-president and chief artificial intelligence officer at Core42. “After its release, we heard from all the different types of organisations that make up the UAE, including the department of health, oil and gas companies, the national airline, banks, government ministries, and the national telco. They looked at the technology and told us what they wanted to use it for.

Some organisations said they wanted to run the language model on-premise, which would require an enormous amount of infrastructure, given how much processing is needed to run inference on a 30-billion-parameter LLM.  

“Jais-13B was an experiment. We proved our case, and we got the feedback needed to drive a larger model. This is just the first release of our 30-billion-parameter model. We may have further releases down the line”
Andrew Jackson, Core42

But the partners found another approach that would allow them to accomplish the same thing: enterprise application suppliers could integrate the model into their software, using application programming interfaces (APIs) to access the power of the large model. 

“We discussed this with Microsoft,” said Jackson. “We’re now working with them to use our model for this region, with our technology natively loaded. We’re working on a whole bunch of use cases right now – everything from investment in finance to climate control. And we expect to close big deals on Jais [soon].”

Additionally, the partners have signed two memorandums of understanding with other organisations on the use of Jais. They expect to spend the first part of 2024 finalising deals and fine-tuning their models for enterprise use. Because the model is much bigger, it can do many more things with just a little tuning. 

Improvements over Jais-13B 

One of Jais-30B’s big improvements over Jais-13B is better training data. The partners found out that some of the data they had been using was of poor quality – for example, much of the Arabic language text on the internet is the result of poor translation from English. They also found there was a lot of redundant data – for example, multiple copies of the same article on different sites. They got rid of the bad data and used tools to filter out redundant text to prevent it from becoming overrepresented in the training data. 

Finally, the partners knew they needed to find the right sources of data. Books and documents tend to have more reliable information than blog posts. On the other hand, some books and documents are written in a formal style they didn’t want their model to imitate in interactions with users. 

Core42 put significant effort into gathering new data – especially from printed material, which was scanned in and run through an optical character recognition (OCR) system. A team of 10 people were assisted by automation tools from Microsoft. “We’ve now used something like 20,000 books and documents,” said Jackson.

The partners also recognised deficiencies in “downstream tasks”, such as summarisation and translation. “We realised that summarisation was not something we did a great job of in the first round, so we put a lot of time and effort into improving those features in Jais-30B,” said Jackson. “Translation wasn’t great either, so we also doubled down on translation for the larger model.”

Jais-30B was trained in under eight weeks, which is record time, according to Cerebras CEO Andrew Feldman. Training was carried out on Condor Galaxy (CG-1), which is based on 64 Cerebras CS-2s and specifically designed to carry out machine learning very quickly. Cerebras and Core42 were able to make modifications to the language model to take advantage of the hardware.

“What we’ve done is representative of a very powerful trend,” said Feldman. “Our two companies were able to learn together at an extraordinary rate and more than double the size of our model in eight weeks. If you can increase the accuracy of your model by double digits every eight weeks, you’re building a huge amount of AI capacity.”

Jais-13B was too small for about half the use cases the partners wanted to address, but the new model is powerful enough to provide the in-depth responses needed by businesses. “We can now do a much more accurate summarisation, a much more accurate translation, and generally much more accurate content generation. Question and answer interactions are now more like GPT-4,” said Jackson.

“Jais-13B was an experiment,” he added. “We proved our case, and we got the feedback needed to drive a larger model. This is just the first release of our 30-billion-parameter model. We may have further releases down the line.”

People working on models for other languages have expressed interest in what the consortium is doing. “We know how to build tokenisers for different languages,” said Jackson. “We can share that knowledge with anyone else who wants to do this. What we’ve done can greatly improve quality of life in non-English-speaking regions.”



Source link