UCT Researchers Develop AI Model For 11 South African Languages

Researchers at the University of Cape Town have developed a pioneering artificial intelligence language model trained on all 11 of South Africa’s official written languages. The breakthrough aims to bridge a longstanding digital divide, improving access to AI technologies for millions previously overlooked by mainstream language tools.

By Mari van der Merwe.

A team at the University of Cape Town (UCT) has developed a new artificial intelligence (AI) language model trained specifically on South Africa’s 11 official written languages – helping close a gap that has left millions underserved by mainstream AI tools.

The research, which will be presented at the Language Resources and Evaluation Conference (LREC) in Mallorca, Spain, this month, introduces two interconnected contributions.

The first is MzansiText, a curated multilingual dataset covering the 11 official written languages, and MzansiLM, a language model trained on that dataset from scratch.

The work was led by Anri Lombard and Dr Jan Buys from UCT’s Department of Computer Science, together with Dr Francois Meyer and a broader team of collaborators.

The UCT researchers behind MzansiLM. From left: Simbarashe Mawere, Anri Lombard, Dr Jan Buys, and Dr Francois Meyer. (Photo: uct.ac.za)

The paper arrives at a moment when AI language tools have become part of daily life for millions of people worldwide. But for speakers of most South African languages, that reality looks quite different. Ask a popular AI assistant a question in isiNdebele or Sepedi, and the response is likely to be poor, inconsistent, or simply wrong.

The reason, the researchers explain, comes down to data.

“In language modelling, languages are considered low resource, primarily because there are much fewer and smaller textual datasets available in these languages for training language models,” said Dr Buys, a senior lecturer in the Department of Computer Science.

“Our dataset, MzansiText, is still small compared to data available for high-resource languages such as English and major European and Asian languages, but larger than previous datasets for South African languages.”

“MzansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages.”

Nine of South Africa’s 11 official written languages fall into this low-resource category. Languages like isiZulu and isiXhosa have received some attention from the global research community, but others, including isiNdebele and Sepedi, have been largely overlooked. MzansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages.

“There has been real progress in language modelling for African languages, including some South African ones like isiXhosa and isiZulu,” said Dr Meyer, a lecturer in the Department of Computer Science. “But most existing models only cover a subset of languages. With MzansiLM, we wanted to build a single model focused specifically on South Africa that covers all 11 official written languages, including those that are often left out.”

From master’s research to a baseline for the field

For Lombard, a master’s student in computer science, the project began with a recurring question in his research.

“I came into this work through my master’s research, which looks at how different language-model architectures perform for low-resource languages, since that is still a relatively underexplored area,” he explained. “One thing that stood out to me is that publicly available models tended to cover only a subset of the South African languages we care about. MzansiLM was meant to provide a small decoder-only baseline that future work can compare against and build on.”

The model itself, at 125 million parameters, is modest by the standards of today’s commercial AI systems. But the team’s tests showed it performing competitively on specific tasks, outperforming much larger open-source models on benchmarks in several South African languages.

On isiXhosa text generation, for instance, it produced results that competed with encoder-decoder models more than 10 times its size.

Not a chatbot, but a foundation

It is worth being clear about what MzansiLM is and what it is not. Unlike tools such as ChatGPT or Claude, it is not designed for open-ended conversation. It is a base model – a foundation that developers and researchers can adapt for specific purposes through a process known as fine-tuning.

“In practice, that means developers could build tools for specific use cases; for example, summarising information or annotating raw data, in South African languages,” Meyer said. “Adapting MzansiLM for a limited use case might be more effective and affordable than relying on proprietary large language models, if you want users to be able to interact with a system in their home language.”

“Our findings show that the model can work well when fine-tuned for specific tasks but is not yet able to work well for general-purpose user interaction or instruction.”

The more immediate benefits for everyday users will come from future, larger versions of the model and from systems built on top of this foundation. But the research also sheds light on a broader question: Why do even powerful commercial AI systems still struggle with languages other than English?

“Our findings show that the model can work well when fine-tuned for specific tasks but is not yet able to work well for general-purpose user interaction or instruction following, due to the limited training data,” Buys explained. “This helps to explain why even larger language models don’t yet work as well when used in languages other than English.”

An open research community is essential

The team is clear that MzansiLM is a step, not a destination. Closing the gap between South African languages and the capabilities now available in English will require sustained, collective effort.

“A lot of the progress we were able to make depends on earlier open research from the African Natural Language Processing research community, so continuing that openness is essential,” Lombard said.

“We still need better and broader data sources, stronger benchmarks, and the kind of shared datasets, models, code, and results that make it possible for others to reproduce and extend the work.”

Meyer echoed that view. “The research community plays an important role here by working openly, sharing datasets, models, and findings so others can build on them. That kind of openness is often what leads to progress, especially compared to proprietary systems where much of the data and methodology isn’t accessible.”

The UCT team has made both MzansiText and MzansiLM publicly available. The paper, “MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages”, is available on arXiv. – news.uct.ac.za

Trending

UCT Researchers Develop AI Model For 11 South African Languages

The Biggest Untapped EV Market On Earth Is Hiding In Plain Sight

Fast Bets, Fading Bread – Africa’s Online Gambling Pandemic

Off Track: The Collapse of Athletics South Africa and the Race to Save It

From Silence to Sanctimony as Africa’s Leaders Shun South Africa’s Migration Crisis

Subscribe to News

OPINION

FEATURES

SPORT

Company

Other

UCT Researchers Develop AI Model For 11 South African Languages

Keep Reading

The Biggest Untapped EV Market On Earth Is Hiding In Plain Sight

Fast Bets, Fading Bread – Africa’s Online Gambling Pandemic

Off Track: The Collapse of Athletics South Africa and the Race to Save It

From Silence to Sanctimony as Africa’s Leaders Shun South Africa’s Migration Crisis

Subscribe to News

OPINION

FEATURES

SPORT

Company

Other