project collecting a massive new data set

Artificial intelligence (AI) tools such as ChatGPT, DeepSeek, Siri or Google Assistant are developed in the Global North and taught in English, Chinese or European languages. In comparison, African languages are sorely lacking online.

A team of African computer scientists, linguists, language specialists and others have been working on exactly this problem for two years. The African Next Voices project, funded largely by the Gates Foundation (with other funding from Meta) and involving a network of African universities and organizations, recently released the largest dataset of African languages to date for AI. We asked them about their project, which has facilities in Kenya, Nigeria and South Africa.

Why is language so important to AI?

Language is how we communicate, ask for help, and hold meaning in community. We use it to organize complex thoughts and share ideas. It’s the tool we use to tell the AI what we want and decide if it has understood us.

We are seeing a growing number of applications that rely on artificial intelligence, from education to health to agriculture. These models are developed using large amounts of (mostly) linguistic (speech) data. These are called Large Language Models, or LLMs, but are only available in a few languages around the world.

Read more: AI in Africa: 5 issues to solve for digital equality

Languages also carry culture, values and local wisdom. If an AI doesn’t speak our languages, it can’t reliably understand our intentions, and we can’t trust or verify its responses. In short: without language, AI can’t communicate with us – and we can’t communicate with it. Therefore, creating artificial intelligence in our languages is the only way AI can help humans.

If we limit whose language can be modeled, we risk losing much of human culture, history and knowledge.

Why are African languages missing and what are the implications for AI?

The development of language is intertwined with human history. Many who experienced colonialism and empire saw their languages marginalized and underdeveloped in the same way as colonial languages. African languages are not so often recorded, also on the Internet.

Thus, there is not enough high-quality digital text and language to train and evaluate robust AI models. This disadvantage is the result of decades of policy decisions favoring colonial languages in schools, media and government.

Read more: AI chatbots can improve public health in Africa – why language inclusion matters

Speech data is just one of the things missing. Do we have dictionaries, terms, glossaries? Basic tools are few and far between, and many other issues raise the cost of creating datasets. These include African language keyboards, fonts, spell checkers, tokens (which break text into smaller chunks so the language model can understand it), orthographic variations (regional differences in spelling of words), tone marking, and a rich variety of dialects.

The result is artificial intelligence that works poorly and sometimes insecurely: wrong translations, poor transcription, and systems that barely understand African languages.

In practice, this prevents many Africans from accessing global news, educational materials, healthcare information and the potential productivity gains of artificial intelligence in their own languages.

When the data doesn’t have a language, the product doesn’t have its speakers, so AI can’t be safe, useful, or fair to them. They lack the necessary language technology tools to support service delivery. It alienates millions of people and widens the technology gap.

What does your project do about it and how?

Our primary goal is to collect speech data for Automatic Speech Recognition (ASR). ASR is an important tool for widely spoken languages. This technology converts spoken language into written text.

The larger goal of our project is to investigate how ASR data is collected and how much data is needed to build ASR tools. We aim to share our experience in various geographical regions.

The data we collect is diverse: spontaneous and read speech; in a variety of areas including everyday conversation, healthcare, financial inclusion and agriculture. We collect data from people of all ages, genders and backgrounds.

Every record is collected with informed consent, fair compensation and clear terms of data rights. We rewrite according to language specific guidelines and many other technical checks.

In Kenya, we collect voice data in five languages through the Maseno Center for Applied AI. We record three main language groups Nilotic (Dholu, Maasai and Kalendin) as well as Cushitic (Somali) and Bantu (Kikuyu).

Read more: What do Nigerian children think about computers? Our research showed

With Data Science Nigeria, we collect speech in five widely spoken languages - Bambara, Hausa, Igbo, Nigerian Pidgin and Yoruba. The dataset aims to accurately reflect authentic language use in these communities.

In South Africa, working through the Data Science for Social Impact Lab and its collaborators, we recorded seven South African languages. The aim is to reflect the country’s rich linguistic diversity: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele and Tshivenda.

Importantly, this work does not happen in isolation. We support the momentum and ideas of the Masakhane Research Foundation Network, Lelapa AI, Mozilla Common Voice, EqualyzAI and many other organizations and individuals who have pioneered African language models, data and tools.

Each project strengthens the others, and together they form a growing ecosystem committed to making African languages visible and usable in the age of AI.

How can it be used?

Data and models will be useful for inserting local language captions; Voice assistants for agriculture and health; call center and support in these languages. Data will also be archived for cultural preservation.

Read more: Hype and Western values shape AI reporting in Africa: what needs to change

Larger, balanced, publicly available datasets of African languages will allow us to bring together text and language resources. The models will not only be experimental, but also useful for chatbots, training tools, and local service delivery. Beyond datasets, there is an opportunity for ecosystems of tools (spell checkers, dictionaries, translation systems, summarization engines) that bring African languages to life in digital spaces.

In short, we combine ethically collected high-quality language with models. The goal is to enable people to speak naturally, be accurately understood, and reach AI in the languages they live their lives in.

What’s next for the project?

This project only collected voice data for certain languages. What about the rest of the languages? What about other tools like machine translation or grammar checkers?

We will continue to work across multiple languages, ensuring that we build data and models that reflect how Africans use their languages. We prefer smaller language models that are efficient and accurate in the African context.

The challenge now is integration: to make these pieces work together so that African languages are not only represented in separate demos, but also on real platforms.

One of the lessons from this project and others like it is that data collection is only the first step. It is important to ensure that data is collated, reused and linked to communities of practice. “Next” for us is to ensure that the ASR guidelines we develop can link to other African efforts.

Read more: Does AI pose an existential risk? We asked 5 experts

We also need to ensure sustainability: that students, researchers and innovators have continuous access to computing (computer resources and processing power), teaching materials and licensing systems (such as NOODL or Esethu). The long-term vision is to enable choice: for a farmer, teacher or local business to use AI in isiZulu, Hausa or Kikuyu, not just English or French.

If we’re lucky, built-in AI in African languages won’t just catch up. It will set new standards for inclusive, responsible AI around the world.

This article is republished from The Conversation, a not-for-profit independent news organization that provides facts and sound analysis to help make sense of our complex world. Written by: Vukosi Marivate, University of Pretoria; Ife Adebara, University of Albertaand Lilian Wanzare, Masseno University

Read more:

Vukosi Marivate is the founder of leleka ai. dsfsi is funded by Gates Foundation, Meta, Google.org, Absa Up Chair of Data Science). Vukosi is the founder of the Deep Learning Indaba and Masakhane Research Foundation. Vukosi is a member of the AI Partnership and the Higher Education Council of South Africa.

Ife Adebara is the founder and chief technology officer of EqualyzAI. She receives funding from the Gates Foundation, Lacuna and the University of British Columbia and is affiliated with Data Science Nigeria.

Lilian Wanzare receives funding from the Gates Foundation. it is affiliated with Maseno University and the Utah AI Foundation. .

Why is language so important to AI?

Why are African languages ​​missing and what are the implications for AI?

What does your project do about it and how?

How can it be used?

What’s next for the project?

Leave a Comment Cancel reply

Why are African languages missing and what are the implications for AI?