WAXAL: a large-scale open resource for African language speech technology

Anchoring in the African AI Ecosystem

Our commitment to working with and directly contributing to the African AI ecosystem has been crucial to the WAXAL project. The data collection effort was led entirely by African academic and community organizations, guided by Google experts on world-class data collection practices. This collaborative approach ensured that the corpus was built by and for the community it serves. With a shared methodology, each partner focused on a specific subset of languages.

Collaborative Partnerships

Our partners included Makerere University, which collected ASR and/or TTS data for nine different languages, and the University of Ghana, which focused its efforts on eight languages, using the image-based ASR data collection methodology described above. Other key collaborators were Digital Umuganda, in partnership with Addis Ababa University, which was instrumental in leading the ASR collection for several regional languages. For the high-quality studio-recorded vocals, Media Trust, Loud n Clear, and the African Institute of Mathematical Sciences in Senegal led the TTS recordings in various regional languages.

Data Ownership and Open Access

This framework is fundamentally rooted in the principle that our partners retain ownership of the data collected as part of the shared commitment to make all datasets freely available to the wider community. This deep collaboration and open access philosophy has already enabled notable research and spin-off publications.

Research and Innovation

Through this framework, our partners have already enabled new research, such as the development of a recipe book for community speech disorder collection. This research resulted in the first open-source dataset for Akan speakers with conditions such as cerebral palsy and stuttering, and demonstrated that in-person, picture-elicited elicitation is more effective than text-based prompts for these populations. This work provides a critical roadmap for developing inclusive voice technologies in low-resource settings.

Additionally, the initiative supported a major study that introduced a 5,000-hour speech corpus for five Ghanaian languages: Akan, Ewe, Dagbani, Dagaare, and Ikposo. This work established an infrastructure to build robust ASR and TTS systems adapted to the linguistic diversity of West Africa using a controlled crowdsourcing approach to capture natural and spontaneous intonations.

Other key research focused on the comparative analysis of four state-of-the-art models (Whisper, XLS-R, MMS, and W2v-BERT) in 13 African languages. This study analyzed how performance scales with increasing training data, providing key insights into data effectiveness and highlighting that scaling benefits are highly dependent on linguistic complexity and domain alignment.

Finally, a systematic literature review was published, cataloging 74 datasets in 111 African languages to map the current frontier of voice technology. This review highlighted the urgent need for multi-domain conversational corpora and the adoption of linguistic measures, such as character error rate (CER), to better assess performance in morphologically rich and tonic linguistic contexts.

For more detailed information and further reading, visit the source here.

“`

US investor Lockheed Martin Ventures commits at least €87 million to Europe as it opens new UK office

With new funding, Monumental plans to bring its construction robots to the United States

This portable gaming PC deal makes the MSI Claw 8 much easier to recommend

Bunkerhill raises $55M to scale agent AI across healthcare system

WAXAL: a large-scale open resource for African language speech technology

Anchoring in the African AI Ecosystem

Collaborative Partnerships

Data Ownership and Open Access

Research and Innovation

US investor Lockheed Martin Ventures commits at least €87 million to Europe as it opens new UK office

With new funding, Monumental plans to bring its construction robots to the United States

This portable gaming PC deal makes the MSI Claw 8 much easier to recommend

Bunkerhill raises $55M to scale agent AI across healthcare system

I turned off this HDMI setting, and my TV finally stopped glitching

Introducing Nested Learning: a new ML paradigm for continuous learning

Your AI agent says “Done!” » — Here’s how to tell if it’s a lie

Towards a demystification of the creativity of diffusion models

5 Real-World SQL Projects to Build Your Data Portfolio

Extension of our CoWork agent with a Cortex agent skill.

LEAVE A REPLY Cancel reply

Useful Links

Latest News

With new funding, Monumental plans to bring its construction robots to the United States

This portable gaming PC deal makes the MSI Claw 8 much easier to recommend

Bunkerhill raises $55M to scale agent AI across healthcare system

Our Newsletter