Technologies for Enhanced Documentation of African Languages (TEDAL)

Workshop organized by Felix Ameka, Emmanuel Ngué Um, Sara Petrollino, Mmasibidi Setaka and Daan van Esch

Short description
Programme
Sponsors
Short description

This proposal is the Lorentz-eScience winner 2021.

Language documentation in Africa has been on the increase in the past 20 years, with many researchers documenting languages for which little or no record existed. Despite this, there are several bottlenecks that slow down the current workflow and hamper the application of cutting-edge AI technologies to these less resourced languages. A consequence of this is that less resourced languages are left behind and cannot benefit from the recent advances in machine learning and automatic speech recognition technologies that have played an important role in enhancing the visibility and information flow of major world languages. 

Typical bottlenecks in the language documentation pipeline are the transcriptions of audio or video recorded data, data annotations (e.g. translations, POS tagging), and the lack of technical computational experience among language documenters. If these bottlenecks are not addressed, language documentation processes cannot be enhanced, directly impacting not only the availability of the documentation output in terms of the records of linguistic practices, but also increasing the digital divide that separates less-resourced languages from major world languages. 

The workshop seeks to be a testing ground for data and software carpentry where documentary linguists, language community members, computer scientists and software developers can exchange their knowledge and experience about the application of AI technologies to African languages and the field of documentary linguistics. In this regard, the organisers’ wish is to align the language documentation agenda with the national AI research agenda (AIREA-NL), and contribute to a human-centered AI approach as  advocated by The Hybrid Intelligence Centre and Humane AI.  

The workshop builds on discussions and lessons learnt from the workshop hosted by the Lorentz Center in 2019: Digital Humanities – the perspective of Africa (https://dhafrica.blog/lorentzworkshop/) and follow-up meetings such as the ones that took place in the context of the UNESCO conference LT4All, and the Global Digital Humanities Symposium at Michigan State University. 

The workshop will provide academics and professionals who work with African languages with practical training and technological solutions: the programme will feature two invited talks, hands-on sessions on specific machine learning applications that are now being tested in language documentation (such as ELPIS, OCTRA, and others), followed by knowledge exchange sessions moderated by Masakhane (NLP for African languages) and by Dorothy Gordon, Chair of the UNESCO programme Information for All. Selected participants will be required to have recorded language data at various stages of the documentation pipeline (e.g. transcribed and untranscribed audio/video recordings, annotated texts, etc.) and they will work on their own data under the guidance of computer scientists from both  academic and non-academic institutions. The knowledge-exchange sessions will consist of interactive discussion groups where the participants will discuss ways in which current technology can be shaped to assist  the needs of documentary linguists but also how the existing technology can be incorporated in language documentation to enhance the discipline. 

The workshop will take place at the end of May 2021, as a satellite event of the 10th World Congress of African Linguistics, Wocal


Programme

All times in CET

DAY 1- MAY 31

12:15 Walk-in moment
12:45 Welcome speech by Lorentz Center
13:00 Intro speech by TEDAL organisers
13:15 Introduction of workshop participants

TALK 1: Tools for the new linguist – Language technology working for you. Dorothy Gordon (UNESCO programme Information for All)
14:00-14:30: Live talk 
14:30- 14:45: Discussions in break-out rooms
14:45-15:00 Plenary discussion with Dorothy Gordon

15:00-15:30 Coffee break and networking

15:30-16:30 Knowledge exchange session 1: Christoph Draxler (Ludwig-Maximilians-Universität München and CLARIN D) “OCTRA”

16:30 – 17:30 Knowledge exchange session 2: Vukosi Marivate (University of Pretoria, Data Science for Social Impact, Masakhane), “Recipes for low-resource language research: a practical perspective”.  
DAY 2 – JUNE 1 

TALK 2: From assemblages to corpora and back: How to enable broadest usage of documentary collections – Mandana Seyfeddinipur (ELDP & ELAR)
10:00-10:25 Live talk
10:25 – 10:45 Discussions in break-out rooms
10:45- 11:00 Plenary discussion with Mandana Seyfeddinipur

11:00-11:15 Coffee break and networking time 

11:15-12:15 Knowledge exchange session 3: Antonis Anastasopoulos (George Mason University) “Language Technology for Language Documentation: What it can and cannot do”

12:15-13:15 (Lunch) break (off-line)

13:15-14:15 Knowledge exchange session 4: Martha Yifiru Tachbelie (University of Addis Ababa),
“Multilingual ASR for Ethiopian languages” 


14:15 – 14:30 Coffee break (off-line)

14:30-15:30 Knowledge exchange session 5: Moses Ekpenyong (University of Uyo), “SCANNAL – Mining Linguistic Corpora for Intelligent Tone Languages Documentation” 

DAY 3 – JUNE 2

09:30-10:15  ELPIS workshop, Daan van Esch and Ben Foley

10:30 – 11:30 One-to-one Q&A sessions on ELPIS

15:00-15:30 Networking tea-time

15:30-16:30 Afternoon booth with Audace Niyonkuru, founder and CEO of DIGITAL UMUGANDA  
DAY 4 – JUNE 3  

09:00 Mmasibidi Setaka & Juan Steyn (SADiLaR), Tools and workflow for creating ASR-oriented data



09:00 – 09:30 Introduction: Speech data collection through web interface
09:00 – 10:00 Hands-on session
10:00 – 10:15 Coffee break
10:15 – 11:00 Introduction: Woefzela and similar approaches
11:00 – 11:15 Coffee break
11:15 – 12:40 Hands-on session
12:40 – 13:00 Conclusion
DAY 5 – JUNE 4

TALK 3 – Neural attention for Language Documenters: towards explainable linguistics with deep learning – Stephan Raaijmakers, Prof. of Communicative AI at Leiden University
10:00 – 10:15 Pitch and sum-up of talk
10:15 – 10:30 Discussions in break-out rooms
:

Room 1. [Resources] The feasibility of creating hand-annotated resources for African (and in general: low-resource) languages

Room 2. [Community building] Organizing an NLP/AI community for African languages: existing and emerging structures

Room 3. [Education/training] Educating/training language documenters for working with AI-methods: needs, possibilities, challenges

Room 4. [Applications] Self-identified applications of neural attention (and other AI) methods for supporting language documenters 

10:30 – 10:45 Plenary discussion with Stephan Raaijmakers

10:45- 11:15 Coffee break and networking

11:15-12:15 Knowledge exchange session 6: Technologies for Sign Languages (Victoria Nyst & Manolis Fragkiadakis, Leiden U Centre for Digital Humanities)

12:15-13:30 (Lunch) break (off-line)

TALK 4 – AI for Development, Kathleen Siminyu , regional coordinator at AI4D. 
13:30 – 14:00 Live talk
14:00 – 14:15 Discussions in break-out rooms: 


Room 1: Multi-disciplinary collaboration for AI development


Room 2: Platforms, tools, and competitive challenges for advancing AI research and innovation

Room 3: Future of work in Africa powered by language tools


Room 4: Future of education in Africa powered by language tools

14:15 – 14:30 Plenary discussion with Kathleen Siminyu 

14:30 – 14:45 Coffee break  

14:45-15:30 Roundtable and evaluation of the technology tested during the workshop, results, next steps, group reports. Closing.

Sponsors: