Scores for languages when translating from English and back to English again, correlated to how many sample sentences the language has. Toward the right side, higher numbers of example sentences result in better scores. There are outliers, such as English in Cyrillic, which has very few examples but translates well. What do you do after you have collected writing samples for a thousand languages for the purpose of translation, and humans still rate the resulting translations a fail? And that is the interesting work that Google machine learning scientists related this month in a massive research paper on multi-lingual translation, "Building Machine Translation Systems for the Next Thousand Languages."
Facebook AI is introducing, M2M-100 the first multilingual machine translation (MMT) model that translates between any pair of 100 languages without relying on English data. When translating, say, Chinese to French, previous best multilingual models train on Chinese to English and English to French, because English training data is the most widely available. Our model directly trains on Chinese to French data to better preserve meaning. It outperforms English-centric systems by 10 points on the widely used BLEU metric for evaluating machine translations. M2M-100 is trained on a total of 2,200 language directions -- or 10x more than previous best, English-centric multilingual models.
For people who understand languages like English, Mandarin, or Spanish, it may seem like today's apps and web tools already provide the translation technology we need. But billions of people are being left out -- unable to easily access most of the information on the internet or connect with most of the online world in their native language. Today's machine translation (MT) systems are improving rapidly, but they still rely heavily on learning from large amounts of textual data, so they do not generally work well for low-resource languages, i.e., languages that lack training data, and for languages that don't have a standardized writing system. Eliminating language barriers would be profound, making it possible for billions of people to access information online in their native or preferred languages. Advances in MT won't just help those people who don't speak one of the languages that dominates the internet today; they'll also fundamentally change the way people in the world connect and share ideas.
Even though Afaan Oromo is the most widely spoken language in the Cushitic family by more than fifty million people in the Horn and East Africa, it is surprisingly resource-scarce from a technological point of view. The increasing amount of various useful documents written in English language brings to investigate the machine that can translate those documents and make it easily accessible for local language. The paper deals with implementing a translation of English to Afaan Oromo and vice versa using Neural Machine Translation. But the implementation is not very well explored due to the limited amount and diversity of the corpus. However, using a bilingual corpus of just over 40k sentence pairs we have collected, this study showed a promising result. About a quarter of this corpus is collected via Community Engagement Platform (CEP) that was implemented to enrich the parallel corpus through crowdsourcing translations.
What if you didn't need English to translate? Meta's new and improved open source AI model'NLLB-200' is capable of translating 200 languages without English! "Communicating across languages is one superpower that AI provides, but as we keep advancing our AI work it's improving everything we do--from showing the most interesting content on Facebook and Instagram, to recommending more relevant ads, to keeping our services safe for everyone", says Mark Zuckerberg, CEO, Meta. Accessibility through language ensures that the benefits of the advancement of technology reach everyone, no matter what language they may speak. Tech companies are assuming a proactive role in attempting to bridge this gap.