HomeMachine LearningWAXAL: a large-scale open resource for African language speech technology

WAXAL: a large-scale open resource for African language speech technology

Anchoring in the African AI Ecosystem

Our commitment to working with and directly contributing to the African AI ecosystem has been crucial to the WAXAL project. The data collection effort was led entirely by African academic and community organizations, guided by Google experts on world-class data collection practices. This collaborative approach ensured that the corpus was built by and for the community it serves. With a shared methodology, each partner focused on a specific subset of languages.

Collaborative Partnerships

Our partners included Makerere University, which collected ASR and/or TTS data for nine different languages, and the University of Ghana, which focused its efforts on eight languages, using the image-based ASR data collection methodology described above. Other key collaborators were Digital Umuganda, in partnership with Addis Ababa University, which was instrumental in leading the ASR collection for several regional languages. For the high-quality studio-recorded vocals, Media Trust, Loud n Clear, and the African Institute of Mathematical Sciences in Senegal led the TTS recordings in various regional languages.

Data Ownership and Open Access

This framework is fundamentally rooted in the principle that our partners retain ownership of the data collected as part of the shared commitment to make all datasets freely available to the wider community. This deep collaboration and open access philosophy has already enabled notable research and spin-off publications.

Research and Innovation

Through this framework, our partners have already enabled new research, such as the development of a recipe book for community speech disorder collection. This research resulted in the first open-source dataset for Akan speakers with conditions such as cerebral palsy and stuttering, and demonstrated that in-person, picture-elicited elicitation is more effective than text-based prompts for these populations. This work provides a critical roadmap for developing inclusive voice technologies in low-resource settings.

Additionally, the initiative supported a major study that introduced a 5,000-hour speech corpus for five Ghanaian languages: Akan, Ewe, Dagbani, Dagaare, and Ikposo. This work established an infrastructure to build robust ASR and TTS systems adapted to the linguistic diversity of West Africa using a controlled crowdsourcing approach to capture natural and spontaneous intonations.

Other key research focused on the comparative analysis of four state-of-the-art models (Whisper, XLS-R, MMS, and W2v-BERT) in 13 African languages. This study analyzed how performance scales with increasing training data, providing key insights into data effectiveness and highlighting that scaling benefits are highly dependent on linguistic complexity and domain alignment.

Finally, a systematic literature review was published, cataloging 74 datasets in 111 African languages to map the current frontier of voice technology. This review highlighted the urgent need for multi-domain conversational corpora and the adoption of linguistic measures, such as character error rate (CER), to better assess performance in morphologically rich and tonic linguistic contexts.

For more detailed information and further reading, visit the source here.

“`

Must Read
Related News

LEAVE A REPLY

Please enter your comment!
Please enter your name here