Technologies for Enhanced Documentation of African Languages (TEDAL)

Workshop organized by Felix Ameka, Emmanuel Ngué Um, Sara Petrollino, Mmasibidi Setaka and Daan van Esch

Short description
Programme
Sponsors
Short description

This proposal is the Lorentz-eScience winner 2021.

Language documentation in Africa has been on the increase in the past 20 years, with many researchers documenting languages for which little or no record existed. Despite this, there are several bottlenecks that slow down the current workflow and hamper the application of cutting-edge AI technologies to these less resourced languages. A consequence of this is that less resourced languages are left behind and cannot benefit from the recent advances in machine learning and automatic speech recognition technologies that have played an important role in enhancing the visibility and information flow of major world languages. 

Typical bottlenecks in the language documentation pipeline are the transcriptions of audio or video recorded data, data annotations (e.g. translations, POS tagging), and the lack of technical computational experience among language documenters. If these bottlenecks are not addressed, language documentation processes cannot be enhanced, directly impacting not only the availability of the documentation output in terms of the records of linguistic practices, but also increasing the digital divide that separates less-resourced languages from major world languages. 

The workshop seeks to be a testing ground for data and software carpentry where documentary linguists, language community members, computer scientists and software developers can exchange their knowledge and experience about the application of AI technologies to African languages and the field of documentary linguistics. In this regard, the organisers’ wish is to align the language documentation agenda with the national AI research agenda (AIREA-NL), and contribute to a human-centered AI approach as  advocated by The Hybrid Intelligence Centre and Humane AI.  

The workshop builds on discussions and lessons learnt from the workshop hosted by the Lorentz Center in 2019: Digital Humanities – the perspective of Africa (https://dhafrica.blog/lorentzworkshop/) and follow-up meetings such as the ones that took place in the context of the UNESCO conference LT4All, and the Global Digital Humanities Symposium at Michigan State University. 

The workshop will provide academics and professionals who work with African languages with practical training and technological solutions: the programme will feature two invited talks, hands-on sessions on specific machine learning applications that are now being tested in language documentation (such as ELPIS, OCTRA, and others), followed by knowledge exchange sessions moderated by Masakhane (NLP for African languages) and by Dorothy Gordon, Chair of the UNESCO programme Information for All. Selected participants will be required to have recorded language data at various stages of the documentation pipeline (e.g. transcribed and untranscribed audio/video recordings, annotated texts, etc.) and they will work on their own data under the guidance of computer scientists from both  academic and non-academic institutions. The knowledge-exchange sessions will consist of interactive discussion groups where the participants will discuss ways in which current technology can be shaped to assist  the needs of documentary linguists but also how the existing technology can be incorporated in language documentation to enhance the discipline. 

The workshop will take place at the end of May 2021, as a satellite event of the 10th World Congress of African Linguistics, Wocal


Programme

All times in CET

DAY 1- MAY 31

11:00 Platform opens
13:00 – 14:00 Intro and welcome speech 

TALK 1: Tools for the new linguist – Language technology working for you. – Dorothy Gordon (UNESCO programme Information for All)
14:00-14:10: Pitch and sum-up of talk 
14:15- 14:30: Discussions in break-out rooms
14:30-14:45 Coffee break 
14:45-15:00 Plenary discussion with Dorothy Gordon
15:00-15:30 Networking

15:30-16:30 Knowledge exchange session 1: Christoph Draxler (Ludwig-Maximilians-Universität München and CLARIN D) “OCTRA”

16:30 – 17:30 Knowledge exchange session 2: Vukosi Marivate (University of Pretoria, Data Science for Social Impact, Masakhane), “Recipes for low-resource language research: a practical perspective”.  


17:30 Online borrel 
DAY 2 – JUNE 1 

TALK 2: Data accessibility and discoverability in language documentation – Mandana Seyfeddinipur (ELDP & ELAR)
10:00-10:10 Pitch and sum-up of talk
10:15 – 10:30 Discussions in break-out rooms
10:30 – 10:45 Coffee break
10:45- 11:00 Plenary discussion with Mandana Seyfeddinipur
11:00-11:15 Networking time 

11:15-12:15 Knowledge exchange session 3: Antonis Anastasopoulos (George Mason University) “Language Technology for Language Documentation: What it can and cannot do”

12:15-13:15 (Lunch) break

13:15-14:15 Knowledge exchange session 4: Martha Yifiru Tachbelie (University of Addis Ababa),
“Multilingual ASR for Ethiopian languages” 


14:15 – 14:30 Coffee break

14:30-15:30 Knowledge exchange session 5: Moses Ekpenyong (University of Uyo), “SCANNAL – Mining Linguistic Corpora for Intelligent Tone Languages Documentation” 
DAY 3 – JUNE 2

09:30  ELPIS day with one-to-one Q&A sessions, Daan van Esch and Ben Foley


15:00 Networking tea-time

15:30 Afternoon booth with DIGITAL UMUGANDA (Founder and CEO Audace Niyonkuru) 
DAY 4 – JUNE 3  

09:00 SADiLaR day with Mmasibidi Setaka & Juan Steyn (SADiLaR), Tools and workflow for creating ASR-oriented data



09:00 – 09:30 Introduction: Speech data collection through web interface
09:00 – 10:00 Hands-on session
10:00 – 10:15 Coffee break
10:15 – 11:00 Introduction: Woefzela and similar approaches
11:00 – 11:15 Coffee break
11:15 – 12:40 Hands-on session
12:40 – 13:00 Conclusion

15:00 Networking tea-time 
DAY 5 – JUNE 4

TALK 3 – Neural attention for Language Documenters: towards explainable linguistics with deep learning – Stephan Raaijmakers, Prof. of Communicative AI at Leiden University
10:00 – 10:10 Pitch and sum-up of talk
10:15 – 10:30 Discussions in break-out rooms
:
1. [Resources] The feasibility of creating hand-annotated resources for African (and in general: low-resource) languages
2. [Community building] Organizing an NLP/AI community for African languages: existing and emerging structures
3. [Education/training] Educating/training language documenters for working with AI-methods: needs, possibilities, challenges
4. [Applications] Self-identified applications of neural attention (and other AI) methods for supporting language documenters 

10:30 – 10:45 Coffee break
10:45- 11:00 Plenary discussion with Stephan Raaijmakers


11:00-11:15 break

11:15-12:15 Knowledge exchange session 6: Technologies for Sign Languages (Victoria Nyst & Manolis Fragkiadakis, Leiden U Centre for Digital Humanities)

12:15-13:30 (Lunch) break

TALK 4 – AI for Development, Kathleen Siminyu , regional coordinator at AI4D. 
13:30 – 13:40 Pitch and sum-up of talk
13:45 – 14:00 Discussions in break-out rooms: 
*Multi-disciplinary collaboration for AI development | *Platforms, tools, and competitive challenges for advancing AI research and innovation | 
*Future of work in Africa powered by language tools | *Future of education in Africa powered by language tools | 

14:00 – 14:15 Coffee break
14:15 – 14:30 Plenary discussion with Kathleen Siminyu  

14:30-15:15 Roundtable and evaluation of the technology tested during the workshop, results, next steps, group reports. 

15:20- 15:30 Closing

Sponsors: