Cissé, Solo Farabado
SMOL: Professionally translated parallel data for 115 under-represented languages
Caswell, Isaac, Nielsen, Elizabeth, Luo, Jiaming, Cherry, Colin, Kovacs, Geza, Shemtov, Hadar, Talukdar, Partha, Tewari, Dinesh, Diane, Baba Mamadi, Doumbouya, Koulako Moussa, Diane, Djibrila, Cissé, Solo Farabado
We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock translation for low-resource languages (LRLs). SMOL has been translated into 115 under-resourced languages, including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOL-Sent, a set of sentences chosen for broad unique token coverage, and SMOL-Doc, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust ChrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOL-Doc, yielding the first factuality datasets for most of these languages.
Machine Translation for Nko: Tools, Corpora and Baseline Results
Doumbouya, Moussa Koulako Bala, Diané, Baba Mamadi, Cissé, Solo Farabado, Diané, Djibrila, Sow, Abdoulaye, Doumbouya, Séré Moussa, Bangoura, Daouda, Bayo, Fodé Moriba, Condé, Ibrahima Sory 2., Diané, Kalo Mory, Piech, Chris, Manning, Christopher
Unfortunately, to over 40 million people across West African countries date, there isn't any usable machine translation including Mali, Guinea, Ivory Coast, Gambia, (MT) system for Nko, in part due to the unavailability Burkina Faso, Sierra Leone, Senegal, Liberia, and of large text corpora required by state-of-the-art Guinea-Bissau. Nko, which means'I say' in all neural machine translation (NMT) algorithms. Manding languages, was developed as both the Nko is a representative case study of the broader Manding literary standard language and a writing issues that interfere with the goal of universal machine system by Soulemana Kanté in 1949 for the translation. Thousands of languages still purpose of sustaining the strong oral tradition of don't have available or usable MT systems, mainly Manding languages (Niane, 1974; Conde, 2017; due to the unavailability of high-quality parallel Eberhard et al., 2023).