Development of a Prosodic Read Speech Syllabic Corpus of the Yoruba Language

Akintoba Emmanuel Akinwonmi. Published in Information Sciences.

Communications on Applied Electronics
Year of Publication: 2021
Publisher: Foundation of Computer Science (FCS), NY, USA
Authors: Akintoba Emmanuel Akinwonmi

Literature revealed that the need for annotated database of speech text or audio files is justified primarily by the requirements for corporal entities to conduct basic Natural Language Processing (NLP) studies on a language. Such investigationstraverse thephonetic, aural and etymological representations of the language. Moreover, research of interest can also span grammatic, semantic, pragmatic and syntactic characterizations of the particular language. At a secondary level an annotated speech corpus is desirable for the purpose of speech synthesistypified by Text-to-Speech (TTS) and recognitionas in Speech-to-Text (STT). Yoruba language, a resource scarce language with a wide usage,has sparse andscarce digital resources and its computerization poses unique challenges. Annotated speech corpus is one of such resources.Hence, this research was motivated by the need to contribute to the scanty resources for the language. This research minedtextual inputs from four sources including two Standard Yoruba (SY) fiction, an SY grammar textbook and an SY Online Scripture. A hybrid of Falaschi scheme and the add-on procedure of Radová and Vopálka were applied to extractphonetically balanced text bag of 7376 phrases and sentences with a view to minimizing the extraction cost, while maximizing phonetic coverage of all standard Yoruba syllabic events. The selected text was read by an expert and recorded in a suitable environment and saved as wave files. The wave files were annotated with Praat. A relational database was developed to host the corpus metadata. The corpus performed impressively when tested with a Standard Yoruba TTS. This paper presents the design, implementation, results and other useful information about the research.


Speech, Corpus, Yoruba Language, chunk, Syllabification