Technologies for enhanced Documentation of African languages: creating synergies

Workshop organized by Felix Ameka, Emmanuel Ngué Um, Sara Petrollino, Mmasibidi Setaka and Daan van Esch

Short description

This proposal is the Lorentz-eScience winner 2021.

Language documentation in Africa has been on the increase in the past 20 years, with many researchers documenting languages for which little or no record existed. Despite this, there are several bottlenecks that slow down the current workflow and hamper the application of cutting-edge AI technologies to these less resourced languages. A consequence of this is that less resourced languages are left behind and cannot benefit from the recent advances in machine learning and automatic speech recognition technologies that have played an important role in enhancing the visibility and information flow of major world languages. 

Typical bottlenecks in the language documentation pipeline are the transcriptions of audio or video recorded data, data annotations (e.g. translations, POS tagging), and the lack of technical computational experience among language documenters. If these bottlenecks are not addressed, language documentation processes cannot be enhanced, directly impacting not only the availability of the documentation output in terms of the records of linguistic practices, but also increasing the digital divide that separates less-resourced languages from major world languages. 

The workshop seeks to be a testing ground for data and software carpentry where documentary linguists, language community members, computer scientists and software developers can exchange their knowledge and experience about the application of AI technologies to African languages and the field of documentary linguistics. In this regard, the organisers’ wish is to align the language documentation agenda with the national AI research agenda (AIREA-NL), and contribute to a human-centered AI approach as  advocated by The Hybrid Intelligence Centre and Humane AI.  

The workshop builds on discussions and lessons learnt from the workshop hosted by the Lorentz Center in 2019: Digital Humanities – the perspective of Africa ( and follow-up meetings such as the ones that took place in the context of the UNESCO conference LT4All, and the Global Digital Humanities Symposium at Michigan State University. 

The workshop will provide academics and professionals who work with African languages with practical training and technological solutions: the programme will feature two invited talks, hands-on sessions on specific machine learning applications that are now being tested in language documentation (such as ELPIS, OCTRA, Woolaroo, and others), followed by knowledge exchange sessions moderated by Masakhane (NLP for African languages) and by Dorothy Gordon, Chair of the UNESCO programme Information for All. Selected participants will be required to have recorded language data at various stages of the documentation pipeline (e.g. transcribed and untranscribed audio/video recordings, annotated texts, etc.) and they will work on their own data under the guidance of computer scientists from both  academic and non-academic institutions. The knowledge-exchange sessions will consist of interactive discussion groups where the participants will discuss ways in which current technology can be shaped to assist  the needs of documentary linguists but also how the existing technology can be incorporated in language documentation to enhance the discipline. 

The workshop will take place at the end of May 2021, as a satellite event of the 10th World Congress of African Linguistics, Wocal

Funded by: