AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

CANTONMT: Investigating Back-Translation and Model-Switch Mechanisms for Cantonese-English Neural Machine Translation

Hong, Kung Yin, Han, Lifeng, Batista-Navarro, Riza, Nenadic, Goran

arXiv.org Artificial IntelligenceMay-13-2024

This paper investigates the development and evaluation of machine translation models from Cantonese to English, where we propose a novel approach to tackle low-resource language translations. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation, and model switch, have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using the human evaluation framework HOPES. The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CANTONMT is available at https://github.com/kenrickkung/CantoneseTranslation

computational linguistic, machine translation, translation, (13 more...)

arXiv.org Artificial Intelligence

2405.08172

Country:

Europe > United Kingdom > England > Greater Manchester > Manchester (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.05)
(23 more...)

Genre: Research Report > Promising Solution (0.48)

Industry:

Health & Medicine (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset

Gautam, Sushant, Sarkhoosh, Mehdi Houshmand, Held, Jan, Midoglu, Cise, Cioppa, Anthony, Giancola, Silvio, Thambawita, Vajira, Riegler, Michael A., Halvorsen, Pål, Shah, Mubarak

arXiv.org Artificial IntelligenceMay-12-2024

The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.

commentary, dataset, transcription, (14 more...)

arXiv.org Artificial Intelligence

2405.07354

Country:

Europe > Norway (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(10 more...)

Genre:

Research Report (0.50)
Overview (0.46)

Industry: Leisure & Entertainment > Sports > Soccer (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

NLP Progress in Indigenous Latin American Languages

Tonja, Atnafu Lambebo, Balouchzahi, Fazlourrahman, Butt, Sabur, Kolesnikova, Olga, Ceballos, Hector, Gelbukh, Alexander, Solorio, Thamar

arXiv.org Artificial IntelligenceMay-12-2024

The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancements that respect indigenous community perspectives. We show the NLP progress of indigenous Latin American languages and the survey that covers the status of indigenous languages in Latin America, their representation in NLP, and the challenges and innovations required for their preservation and development. The paper contributes to the current literature in understanding the need and progress of NLP for indigenous communities of Latin America, specifically low-resource and indigenous communities in general.

america, indigenous language, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2404.05365

Country:

North America > Central America (0.46)
South America > Paraguay (0.14)
South America > Peru (0.05)
(15 more...)

Genre:

Overview (1.00)
Questionnaire & Opinion Survey (0.67)
Research Report (0.64)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology

Hada, Rishav, Husain, Safiya, Gumma, Varun, Diddee, Harshita, Yadavalli, Aditya, Seth, Agrima, Kulkarni, Nidhi, Gadiraju, Ujwal, Vashistha, Aditya, Seshadri, Vivek, Bali, Kalika

arXiv.org Artificial IntelligenceMay-10-2024

Existing research in measuring and mitigating gender bias predominantly centers on English, overlooking the intricate challenges posed by non-English languages and the Global South. This paper presents the first comprehensive study delving into the nuanced landscape of gender bias in Hindi, the third most spoken language globally. Our study employs diverse mining techniques, computational models, field studies and sheds light on the limitations of current methodologies. Given the challenges faced with mining gender biased statements in Hindi using existing methods, we conducted field studies to bootstrap the collection of such sentences. Through field studies involving rural and low-income community women, we uncover diverse perceptions of gender bias, underscoring the necessity for context-specific approaches. This paper advocates for a community-centric research design, amplifying voices often marginalized in previous studies. Our findings not only contribute to the understanding of gender bias in Hindi but also establish a foundation for further exploration of Indic languages. By exploring the intricacies of this understudied context, we call for thoughtful engagement with gender bias, promoting inclusivity and equity in linguistic and cultural contexts beyond the Global North.

computational linguistic, gender bia, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2405.06346

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
(32 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > News (1.00)
Social Sector (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

The Ghanaian NLP Landscape: A First Look

Issaka, Sheriff, Zhang, Zhaoyi, Heda, Mihir, Wang, Keyi, Ajibola, Yinka, DeMar, Ryan, Du, Xuefeng

arXiv.org Artificial IntelligenceMay-10-2024

Despite comprising one-third of global languages, African languages are critically underrepresented in Artificial Intelligence (AI), threatening linguistic diversity and cultural heritage. Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk. This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages, identifying methodologies, datasets, and techniques employed. Additionally, we create a detailed roadmap outlining challenges, best practices, and future directions, aiming to improve accessibility for researchers. This work serves as a foundational resource for Ghanaian NLP research and underscores the critical need for integrating global linguistic diversity into AI development.

arxiv, ghanaian language, machine translation, (10 more...)

arXiv.org Artificial Intelligence

2405.06818

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Africa > Ghana (0.07)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(7 more...)

Genre:

Overview (1.00)
Research Report (0.82)

Industry:

Government (0.46)
Law (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora

Qarah, Faisal

arXiv.org Artificial IntelligenceMay-10-2024

In this paper, we introduce SaudiBERT, a monodialect Arabic language model pretrained exclusively on Saudi dialectal text. To demonstrate the model's effectiveness, we compared SaudiBERT with six different multidialect Arabic language models across 11 evaluation datasets, which are divided into two groups: sentiment analysis and text classification. SaudiBERT achieved average F1-scores of 86.15\% and 87.86\% in these groups respectively, significantly outperforming all other comparative models. Additionally, we present two novel Saudi dialectal corpora: the Saudi Tweets Mega Corpus (STMC), which contains over 141 million tweets in Saudi dialect, and the Saudi Forums Corpus (SFC), which includes 15.2 GB of text collected from five Saudi online forums. Both corpora are used in pretraining the proposed model, and they are the largest Saudi dialectal corpora ever reported in the literature. The results confirm the effectiveness of SaudiBERT in understanding and analyzing Arabic text expressed in Saudi dialect, achieving state-of-the-art results in most tasks and surpassing other language models included in the study. SaudiBERT model is publicly available on \url{https://huggingface.co/faisalq/SaudiBERT}.

dialect, language model, saudibert, (15 more...)

arXiv.org Artificial Intelligence

2405.06239

Country:

Indian Ocean > Arabian Gulf (0.04)
Asia > Middle East > Saudi Arabia > Arabian Gulf (0.04)
Africa > Sudan (0.04)
(2 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.66)

Industry:

Information Technology > Services (1.00)
Banking & Finance (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation

Dolev, Eyal Liron, Lutz, Clemens Fidel, Aepli, Noëmi

arXiv.org Artificial IntelligenceMay-9-2024

Whisper is a state-of-the-art automatic speech recognition (ASR) model (Radford et al., 2022). Although Swiss German dialects are allegedly not part of Whisper's training data, preliminary experiments showed that Whisper can transcribe Swiss German quite well, with the output being a speech translation into Standard German. To gain a better understanding of Whisper's performance on Swiss German, we systematically evaluate it using automatic, qualitative, and human evaluation. We test its performance on three existing test sets: SwissDial (Dogan-Sch\"onberger et al., 2021), STT4SG-350 (Pl\"uss et al., 2023), and Swiss Parliaments Corpus (Pl\"uss et al., 2021). In addition, we create a new test set for this work, based on short mock clinical interviews. For automatic evaluation, we used word error rate (WER) and BLEU. In the qualitative analysis, we discuss Whisper's strengths and weaknesses and anylyze some output examples. For the human evaluation, we conducted a survey with 28 participants who were asked to evaluate Whisper's performance. All of our evaluations suggest that Whisper is a viable ASR system for Swiss German, so long as the Standard German output is desired.

corpus, evaluation, translation, (16 more...)

arXiv.org Artificial Intelligence

2404.1931

Country:

Europe > Switzerland > Zürich > Zürich (0.05)
Europe > Switzerland > Basel-City > Basel (0.04)
Europe > Austria > Vienna (0.04)
(13 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Add feedback

Natural Language Processing RELIES on Linguistics

Opitz, Juri, Wein, Shira, Schneider, Nathan

arXiv.org Artificial IntelligenceMay-9-2024

Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym $RELIES$ that encapsulates six major facets where linguistics contributes to NLP: $R$esources, $E$valuation, $L$ow-resource settings, $I$nterpretability, $E$xplanation, and the $S$tudy of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-a-vis systems of human language.

computational linguistic, linguistic, linguistics, (17 more...)

arXiv.org Artificial Intelligence

2405.05966

Country:

North America > United States > Washington > King County > Seattle (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
North America > Canada > Ontario > Toronto (0.05)
(38 more...)

Genre: Overview (0.93)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Using Machine Translation to Augment Multilingual Classification

King, Adam

arXiv.org Artificial IntelligenceMay-8-2024

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.

dataset, otc loss, translation, (13 more...)

arXiv.org Artificial Intelligence

2405.05478

Country:

Europe > Portugal > Lisbon > Lisbon (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Revisiting character-level adversarial attacks

Rocamora, Elias Abad, Wu, Yongtao, Liu, Fanghui, Chrysos, Grigorios G., Cevher, Volkan

arXiv.org Machine LearningMay-7-2024

Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.

asr, charmer, dataset, (13 more...)

arXiv.org Machine Learning

2405.04346

Country:

Europe > Austria > Vienna (0.14)
Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
Europe > Germany > Rheinland-Pfalz > Mainz (0.04)
(13 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback