AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis

Abdedaiem, Amin, Dahou, Abdelhalim Hafedh, Cheragui, Mohamed Amine, Mathiak, Brigitte

arXiv.org Artificial IntelligenceNov-7-2024

Building a corpus become an important topic in natural language processing (NLP) and especially for low resource languages (ex: AD), due to the importance that the corpus plays in the development of several tools, such as: Machine Translation Babaali and Salem [2022], Part of speech tagging Chiche and Yitagesu [2022], Named entities recognition Jarrar et al. [2022], etc. in particular with the emergence of techniques based on statistics, machine learning and deep learning. Who exploits this mass of information to develop, train and evaluate models. However, building a corpus is not an easy task Bakari et al. [2016]; it is extremely time-consuming and requires a lot of work, for the good reason that the volume and quality of the corpus are two important parameters. Despite the recent emergence of techniques that consume fewer resources, such as few-shot learning Tunstall et al. [2022]. Over the last few years, a lot of studies in NLP have focused on languages or variants of languages called low resources Mengoni and Santucci [2023]. This change of direction is mainly due to the emergence of social media such as Facebook, Twitter, RenRen, LinkedIn, Google+, and Tuenti, as a means of communication where people exchange messages and comments.

algerian dialect, corpus, dialect, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.procs.2024.10.214

2411.04604

Country:

Africa > Middle East > Algeria > Adrar Province > Adrar (0.04)
Europe > Germany (0.04)
North America > United States (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Media > News (0.86)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models

Zhang, Charles, Peng, Benji, Sun, Xintian, Niu, Qian, Liu, Junyu, Chen, Keyu, Li, Ming, Feng, Pohsun, Bi, Ziqian, Liu, Ming, Zhang, Yichao, Fei, Cheng, Yin, Caitlyn Heqi, Yan, Lawrence KQ, Wang, Tianyang

arXiv.org Artificial IntelligenceNov-6-2024

Word embeddings and language models have transformed natural language processing (NLP) by facilitating the representation of linguistic elements in continuous vector spaces. This review visits foundational concepts such as the distributional hypothesis and contextual similarity, tracing the evolution from sparse representations like one-hot encoding to dense embeddings including Word2Vec, GloVe, and fastText. We examine both static and contextualized embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and their adaptations for cross-lingual and personalized applications. The discussion extends to sentence and document embeddings, covering aggregation methods and generative topic models, along with the application of embeddings in multimodal domains, including vision, robotics, and cognitive science. Advanced topics such as model compression, interpretability, numerical encoding, and bias mitigation are analyzed, addressing both technical challenges and ethical implications. Additionally, we identify future research directions, emphasizing the need for scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities. By synthesizing current methodologies and emerging trends, this survey offers researchers and practitioners an in-depth resource to push the boundaries of embedding-based language models.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2411.05036

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Texas (0.04)
North America > Canada (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI

Malinga, Melusi, Lupanda, Isaac, Nkongolo, Mike Wa, van Deventer, Phil

arXiv.org Artificial IntelligenceNov-6-2024

South Africa and the Democratic Republic of Congo (DRC) present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba (Ciluba), which creates unique challenges for AI-driven translation and sentiment analysis systems due to a lack of accurately labeled data. This study seeks to address these challenges by developing a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu. The lexicon enhances cultural relevance in sentiment classification by integrating language-specific sentiment scores. A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models such as Random Forest, Support Vector Machine (SVM), Decision Trees, and Gaussian Naive Bayes (GNB) trained to predict sentiment across low resource languages (LRLs). Among them, the Random Forest model performed particularly well, capturing sentiment polarity and handling language-specific nuances effectively. Furthermore, Bidirectional Encoder Representations from Transformers (BERT), a Large Language Model (LLM), is applied to predict context-based sentiment with high accuracy, achieving 99% accuracy and 98% precision, outperforming other models. The BERT predictions were clarified using Explainable AI (XAI), improving transparency and fostering confidence in sentiment classification. Overall, findings demonstrate that the proposed lexicon and machine learning models significantly enhance translation and sentiment analysis for LRLs in South Africa and the DRC, laying a foundation for future AI models that support underrepresented languages, with applications across education, governance, and business in multilingual contexts.

sentiment, sentiment analysis, sentiment score, (16 more...)

arXiv.org Artificial Intelligence

2411.04316

Country:

Africa > Democratic Republic of the Congo (0.54)
Africa > South Africa > Gauteng > Pretoria (0.04)
Europe > Switzerland (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > New Finding (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(5 more...)

Add feedback

Self-Compositional Data Augmentation for Scientific Keyphrase Generation

Houbre, Mael, Boudin, Florian, Daille, Beatrice, Aizawa, Akiko

arXiv.org Artificial IntelligenceNov-6-2024

State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a self-compositional data augmentation method. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. The advantage of our method lies in its ability to create additional training samples that keep domain coherence, without relying on external data or resources. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain confirms this improvement towards their representativity property.

computational linguistic, keyphrase, proceedings, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3677389.3702504

2411.03039

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(26 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Multilingual hierarchical classification of job advertisements for job vacancy statistics

Beręsewicz, Maciej, Wydmuch, Marek, Cherniaiev, Herman, Pater, Robert

arXiv.org Machine LearningNov-6-2024

The goal of this paper is to develop a multilingual classifier and conditional probability estimator of occupation codes for online job advertisements according in accordance with the International Standard Classification of Occupations (ISCO) extended with the Polish Classification of Occupations and Specializations (KZiS), which is analogous to the European Classification of Occupations. In this paper, we utilise a range of data sources, including a novel one, namely the Central Job Offers Database, which is a register of all vacancies submitted to Public Employment Offices. Their staff members code the vacancies according to the ISCO and KZiS. A hierarchical multi-class classifier has been developed based on the transformer architecture. The classifier begins by encoding the jobs found in advertisements to the widest 1-digit occupational group, and then narrows the assignment to a 6-digit occupation code. We show that incorporation of the hierarchical structure of occupations improves prediction accuracy by 1-2 percentage points, particularly for the hand-coded online job advertisements. Finally, a bilingual (Polish and English) and multilingual (24 languages) model is developed based on data translated using closed and open-source software. The open-source software is provided for the benefit of the official statistics community, with a particular focus on international comparability.

advertisement, classification, dataset, (16 more...)

arXiv.org Machine Learning

2411.03779

Country:

Europe > United Kingdom (0.28)
Europe > Poland > Greater Poland Province > Poznań (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry:

Marketing (1.00)
Education (0.92)
Government > Regional Government > Europe Government (0.46)

Technology:

Information Technology > Software (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

Fan, Yuankai, Ren, Tonghui, Huang, Can, He, Zhenying, Wang, X. Sean

arXiv.org Artificial IntelligenceNov-5-2024

Natural Language Interfaces for Databases empower non-technical users to interact with data using natural language (NL). Advanced approaches, utilizing either neural sequence-to-sequence or more recent sophisticated large-scale language models, typically implement NL to SQL (NL2SQL) translation in an end-to-end fashion. However, like humans, these end-to-end translation models may not always generate the best SQL output on their first try. In this paper, we propose CycleSQL, an iterative framework designed for end-to-end translation models to autonomously generate the best output through self-evaluation. The main idea of CycleSQL is to introduce data-grounded NL explanations of query results as self-provided feedback, and use the feedback to validate the correctness of the translation iteratively, hence improving the overall translation accuracy. Extensive experiments, including quantitative and qualitative evaluations, are conducted to study CycleSQL by applying it to seven existing translation models on five widely used benchmarks. The results show that 1) the feedback loop introduced in CycleSQL can consistently improve the performance of existing models, and in particular, by applying CycleSQL to RESDSQL, obtains a translation accuracy of 82.0% (+2.6%) on the validation set, and 81.6% (+3.2%) on the test set of Spider benchmark; 2) the generated NL explanations can also provide insightful information for users, aiding in the comprehension of translation results and consequently enhancing the interpretability of NL2SQL translation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.02948

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.05)
North America > Aruba (0.04)
North America > Anguilla (0.04)
(13 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Air (1.00)
Aerospace & Defense > Aircraft (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mitigating Metric Bias in Minimum Bayes Risk Decoding

Kovacs, Geza, Deutsch, Daniel, Freitag, Markus

arXiv.org Artificial IntelligenceNov-5-2024

While Minimum Bayes Risk (MBR) decoding using metrics such as COMET or MetricX has outperformed traditional decoding methods such as greedy or beam search, it introduces a challenge we refer to as metric bias. As MBR decoding aims to produce translations that score highly according to a specific utility metric, this very process makes it impossible to use the same metric for both decoding and evaluation, as improvements might simply be due to reward hacking rather than reflecting real quality improvements. In this work we find that compared to human ratings, neural metrics not only overestimate the quality of MBR decoding when the same metric is used as the utility metric, but they also overestimate the quality of MBR/QE decoding with other neural utility metrics as well. We also show that the metric bias issue can be mitigated by using an ensemble of utility metrics during MBR decoding: human evaluations show that MBR decoding using an ensemble of utility metrics outperforms a single utility metric.

ensemble, mbr qe, rankavg, (13 more...)

arXiv.org Artificial Intelligence

2411.03524

Country:

Asia > Singapore (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(10 more...)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)

Add feedback

Language Models and Cycle Consistency for Self-Reflective Machine Translation

Wangni, Jianqiao

arXiv.org Machine LearningNov-4-2024

This paper introduces a novel framework that leverages large language models (LLMs) for machine translation (MT). We start with one conjecture: an ideal translation should contain complete and accurate information for a strong enough LLM to recover the original sentence. We generate multiple translation candidates from a source language A to a target language B, and subsequently translate these candidates back to the original language A. By evaluating the cycle consistency between the original and back-translated sentences using metrics such as tokenlevel precision and accuracy, we implicitly estimate the translation quality in language B, without knowing its ground-truth. This also helps to evaluate the LLM translation capability, only with monolingual corpora. For each source sentence, we identify the translation candidate with optimal cycle consistency with the original sentence as the final answer. Our experiments demonstrate that larger LLMs, or the same LLM with more forward passes during inference, exhibit increased cycle consistency, aligning with the LLM model size scaling law [Kaplan et al. (2020)] and test-time computation scaling law [Snell et al. (2024)]. This work provide methods for, 1) to implicitly evaluate translation quality of a sentence in the target language, 2), to evaluate capability of LLM for any-to-any-language translation, and 3), how to generate a better translation for a specific LLM.

consistency, cycle consistency, translation, (12 more...)

arXiv.org Machine Learning

2411.02791

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Huang, Langlin, Bu, Mengyu, Feng, Yang

arXiv.org Artificial IntelligenceNov-3-2024

MSC (Huang and Feng, 2024) argues that a byte should contribute to multiple neighboring Neural Machine Translation (NMT) is a consistently contexts, necessitating a multi-scale contextualization hot research topic, and recent years have approach. To this end, MSC groups hidden seen the growing significance of multilingual language state dimensions and assigns CNNs with different modeling (Zhang et al., 2023). The selection kernel sizes to each group. of tokenization and vocabulary is critical to Although MSC provides an effective framework multilingual language models, which plays an important for modeling multi-scale contextualization and role in vectorization of texts and discretization achieved state-of-the-art performance, it suffers of predicted hidden states. While some models from a significant limitation: the scales are manually (Costa-jussà et al., 2022; Dubey et al., 2024) predefined. This reduces the model's ability use large vocabularies to ensure word coverage, to generalize to multilingual scenarios, particularly others (Touvron et al., 2023; Jiang et al., 2023) opt in massively multilingual machine translation, for byte fallback strategy. This approach allows which may involve over 50 languages.

artificial intelligence, computational linguistic, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.01474

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(12 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration

Anugraha, David, Kuwanto, Garry, Susanto, Lucky, Wijaya, Derry Tanti, Winata, Genta Indra

arXiv.org Artificial IntelligenceNov-1-2024

We present MetaMetrics-MT, an innovative metric designed to evaluate machine translation (MT) tasks by aligning closely with human preferences through Bayesian optimization with Gaussian Processes. MetaMetrics-MT enhances existing MT metrics by optimizing their correlation with human judgments. Our experiments on the WMT24 metric shared task dataset demonstrate that MetaMetrics-MT outperforms all existing baselines, setting a new benchmark for state-of-the-art performance in the reference-based setting. Furthermore, it achieves comparable results to leading metrics in the reference-free setting, offering greater efficiency.

artificial intelligence, etric -mt, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.0039

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Indonesia (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback