WELCOME TO R:Ed
People using an AI Tool to translate Shona. Photo credit - AI Generated

Shona, AI, and the Linguistic Data Gap in Africa

Introduction

Africa is home to over 2,000 languages, about a third of the world’s living languages, yet only a few are supported by AI translation tools. As a result, many African internet users who do not speak English or other mainstream languages struggle to access online information. Some languages, like Shona, are recognized by AI platforms but remain ‘low-resource’ due to limited written data for training LLMs. Most AI tools, including ChatGPT, focus on English and other widely documented languages, deepening the digital divide, reinforcing linguistic and economic inequities, and restricting access to AI-driven tools across Africa. 

 

A Brief History of the Shona language

Shona is the most widely spoken language in Zimbabwe, serving as the first language of around 80% of the population, and one of the country’s official languages. Historically, it constituted approximately six mutually intelligible dialects: Korekore, Zezuru, Karanga, Manyika, Ndau, and Kalanga. Colonial intervention led to the unification of these dialects into a single standard language in 1931. The linguist Clement Doke created a standardised writing system based on the Zezuru dialect. Shona is a tonal language, meaning that variations in tone and pitch can alter a word’s meaning. However, these tonal features are not represented in written Shona, as Doke believed that accents on letters would be confusing. This makes written interpretation challenging for AI systems.

 

The Challenge of Translation from English to Shona

As English-dominant AI training data, it is often used as a ‘pivot language’ in AI translation. This means that when translating between low-resource languages, the source text may be translated into English first before being translated into the target language. AI translation tools and neural machine translators, such as Google Translate, are largely based on Western concepts of equivalence, rather than cultural approaches to translation. In Shona, the word for translation, kushandura, means to change or alter, rather than to recreate meaning exactly. This highlights that AI is not only dominated by Western data but also reflects Western concepts. Equivalence-based translation overlooks cultural differences and linguistic untranslatability. Shona has grammatical rules and cultural specificities that differ greatly from English. This was challenging during the Covid-19 pandemic, when translators struggled to express technical terms that had no Shona equivalent. Shona is also an agglutinative language. This means that words are formed by linking morphemes (meaningful language units) that indicate grammatical information, such as tense or number. This makes Shona’s structure more comparable to languages like Mandarin Chinese and less like English, further complicating AI translation.

 

Conclusion

The importance of multilingual AI models. Multilingual AI models widen access to research, enabling global collaboration. Increasing high-quality datasets for African low-resource languages, such as Shona, is key to developing AI systems with better cultural awareness. Organisations such as the African Next Voices project and Masakhane are addressing the linguistic data gap by recording languages in diverse contexts and translating academic work into various African languages. By developing open, high-quality text and speech datasets, AI can be better trained to support linguistic diversity.

Lauren Lisk

VIEW ALL POSTS

Leave a reply

Your email address will not be published. Required fields are marked *