Machine Translation
FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis
Abdedaiem, Amin, Dahou, Abdelhalim Hafedh, Cheragui, Mohamed Amine, Mathiak, Brigitte
Building a corpus become an important topic in natural language processing (NLP) and especially for low resource languages (ex: AD), due to the importance that the corpus plays in the development of several tools, such as: Machine Translation Babaali and Salem [2022], Part of speech tagging Chiche and Yitagesu [2022], Named entities recognition Jarrar et al. [2022], etc. in particular with the emergence of techniques based on statistics, machine learning and deep learning. Who exploits this mass of information to develop, train and evaluate models. However, building a corpus is not an easy task Bakari et al. [2016]; it is extremely time-consuming and requires a lot of work, for the good reason that the volume and quality of the corpus are two important parameters. Despite the recent emergence of techniques that consume fewer resources, such as few-shot learning Tunstall et al. [2022]. Over the last few years, a lot of studies in NLP have focused on languages or variants of languages called low resources Mengoni and Santucci [2023]. This change of direction is mainly due to the emergence of social media such as Facebook, Twitter, RenRen, LinkedIn, Google+, and Tuenti, as a means of communication where people exchange messages and comments.
From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
Zhang, Charles, Peng, Benji, Sun, Xintian, Niu, Qian, Liu, Junyu, Chen, Keyu, Li, Ming, Feng, Pohsun, Bi, Ziqian, Liu, Ming, Zhang, Yichao, Fei, Cheng, Yin, Caitlyn Heqi, Yan, Lawrence KQ, Wang, Tianyang
Word embeddings and language models have transformed natural language processing (NLP) by facilitating the representation of linguistic elements in continuous vector spaces. This review visits foundational concepts such as the distributional hypothesis and contextual similarity, tracing the evolution from sparse representations like one-hot encoding to dense embeddings including Word2Vec, GloVe, and fastText. We examine both static and contextualized embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and their adaptations for cross-lingual and personalized applications. The discussion extends to sentence and document embeddings, covering aggregation methods and generative topic models, along with the application of embeddings in multimodal domains, including vision, robotics, and cognitive science. Advanced topics such as model compression, interpretability, numerical encoding, and bias mitigation are analyzed, addressing both technical challenges and ethical implications. Additionally, we identify future research directions, emphasizing the need for scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities. By synthesizing current methodologies and emerging trends, this survey offers researchers and practitioners an in-depth resource to push the boundaries of embedding-based language models.
A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI
Malinga, Melusi, Lupanda, Isaac, Nkongolo, Mike Wa, van Deventer, Phil
South Africa and the Democratic Republic of Congo (DRC) present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba (Ciluba), which creates unique challenges for AI-driven translation and sentiment analysis systems due to a lack of accurately labeled data. This study seeks to address these challenges by developing a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu. The lexicon enhances cultural relevance in sentiment classification by integrating language-specific sentiment scores. A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models such as Random Forest, Support Vector Machine (SVM), Decision Trees, and Gaussian Naive Bayes (GNB) trained to predict sentiment across low resource languages (LRLs). Among them, the Random Forest model performed particularly well, capturing sentiment polarity and handling language-specific nuances effectively. Furthermore, Bidirectional Encoder Representations from Transformers (BERT), a Large Language Model (LLM), is applied to predict context-based sentiment with high accuracy, achieving 99% accuracy and 98% precision, outperforming other models. The BERT predictions were clarified using Explainable AI (XAI), improving transparency and fostering confidence in sentiment classification. Overall, findings demonstrate that the proposed lexicon and machine learning models significantly enhance translation and sentiment analysis for LRLs in South Africa and the DRC, laying a foundation for future AI models that support underrepresented languages, with applications across education, governance, and business in multilingual contexts.
Self-Compositional Data Augmentation for Scientific Keyphrase Generation
Houbre, Mael, Boudin, Florian, Daille, Beatrice, Aizawa, Akiko
State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a self-compositional data augmentation method. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. The advantage of our method lies in its ability to create additional training samples that keep domain coherence, without relying on external data or resources. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain confirms this improvement towards their representativity property.
Multilingual hierarchical classification of job advertisements for job vacancy statistics
Beręsewicz, Maciej, Wydmuch, Marek, Cherniaiev, Herman, Pater, Robert
The goal of this paper is to develop a multilingual classifier and conditional probability estimator of occupation codes for online job advertisements according in accordance with the International Standard Classification of Occupations (ISCO) extended with the Polish Classification of Occupations and Specializations (KZiS), which is analogous to the European Classification of Occupations. In this paper, we utilise a range of data sources, including a novel one, namely the Central Job Offers Database, which is a register of all vacancies submitted to Public Employment Offices. Their staff members code the vacancies according to the ISCO and KZiS. A hierarchical multi-class classifier has been developed based on the transformer architecture. The classifier begins by encoding the jobs found in advertisements to the widest 1-digit occupational group, and then narrows the assignment to a 6-digit occupation code. We show that incorporation of the hierarchical structure of occupations improves prediction accuracy by 1-2 percentage points, particularly for the hand-coded online job advertisements. Finally, a bilingual (Polish and English) and multilingual (24 languages) model is developed based on data translated using closed and open-source software. The open-source software is provided for the benefit of the official statistics community, with a particular focus on international comparability.
Grounding Natural Language to SQL Translation with Data-Based Self-Explanations
Fan, Yuankai, Ren, Tonghui, Huang, Can, He, Zhenying, Wang, X. Sean
Natural Language Interfaces for Databases empower non-technical users to interact with data using natural language (NL). Advanced approaches, utilizing either neural sequence-to-sequence or more recent sophisticated large-scale language models, typically implement NL to SQL (NL2SQL) translation in an end-to-end fashion. However, like humans, these end-to-end translation models may not always generate the best SQL output on their first try. In this paper, we propose CycleSQL, an iterative framework designed for end-to-end translation models to autonomously generate the best output through self-evaluation. The main idea of CycleSQL is to introduce data-grounded NL explanations of query results as self-provided feedback, and use the feedback to validate the correctness of the translation iteratively, hence improving the overall translation accuracy. Extensive experiments, including quantitative and qualitative evaluations, are conducted to study CycleSQL by applying it to seven existing translation models on five widely used benchmarks. The results show that 1) the feedback loop introduced in CycleSQL can consistently improve the performance of existing models, and in particular, by applying CycleSQL to RESDSQL, obtains a translation accuracy of 82.0% (+2.6%) on the validation set, and 81.6% (+3.2%) on the test set of Spider benchmark; 2) the generated NL explanations can also provide insightful information for users, aiding in the comprehension of translation results and consequently enhancing the interpretability of NL2SQL translation.
Mitigating Metric Bias in Minimum Bayes Risk Decoding
Kovacs, Geza, Deutsch, Daniel, Freitag, Markus
While Minimum Bayes Risk (MBR) decoding using metrics such as COMET or MetricX has outperformed traditional decoding methods such as greedy or beam search, it introduces a challenge we refer to as metric bias. As MBR decoding aims to produce translations that score highly according to a specific utility metric, this very process makes it impossible to use the same metric for both decoding and evaluation, as improvements might simply be due to reward hacking rather than reflecting real quality improvements. In this work we find that compared to human ratings, neural metrics not only overestimate the quality of MBR decoding when the same metric is used as the utility metric, but they also overestimate the quality of MBR/QE decoding with other neural utility metrics as well. We also show that the metric bias issue can be mitigated by using an ensemble of utility metrics during MBR decoding: human evaluations show that MBR decoding using an ensemble of utility metrics outperforms a single utility metric.
Language Models and Cycle Consistency for Self-Reflective Machine Translation
This paper introduces a novel framework that leverages large language models (LLMs) for machine translation (MT). We start with one conjecture: an ideal translation should contain complete and accurate information for a strong enough LLM to recover the original sentence. We generate multiple translation candidates from a source language A to a target language B, and subsequently translate these candidates back to the original language A. By evaluating the cycle consistency between the original and back-translated sentences using metrics such as tokenlevel precision and accuracy, we implicitly estimate the translation quality in language B, without knowing its ground-truth. This also helps to evaluate the LLM translation capability, only with monolingual corpora. For each source sentence, we identify the translation candidate with optimal cycle consistency with the original sentence as the final answer. Our experiments demonstrate that larger LLMs, or the same LLM with more forward passes during inference, exhibit increased cycle consistency, aligning with the LLM model size scaling law [Kaplan et al. (2020)] and test-time computation scaling law [Snell et al. (2024)]. This work provide methods for, 1) to implicitly evaluate translation quality of a sentence in the target language, 2), to evaluate capability of LLM for any-to-any-language translation, and 3), how to generate a better translation for a specific LLM.
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
Huang, Langlin, Bu, Mengyu, Feng, Yang
MSC (Huang and Feng, 2024) argues that a byte should contribute to multiple neighboring Neural Machine Translation (NMT) is a consistently contexts, necessitating a multi-scale contextualization hot research topic, and recent years have approach. To this end, MSC groups hidden seen the growing significance of multilingual language state dimensions and assigns CNNs with different modeling (Zhang et al., 2023). The selection kernel sizes to each group. of tokenization and vocabulary is critical to Although MSC provides an effective framework multilingual language models, which plays an important for modeling multi-scale contextualization and role in vectorization of texts and discretization achieved state-of-the-art performance, it suffers of predicted hidden states. While some models from a significant limitation: the scales are manually (Costa-jussà et al., 2022; Dubey et al., 2024) predefined. This reduces the model's ability use large vocabularies to ensure word coverage, to generalize to multilingual scenarios, particularly others (Touvron et al., 2023; Jiang et al., 2023) opt in massively multilingual machine translation, for byte fallback strategy. This approach allows which may involve over 50 languages.
MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration
Anugraha, David, Kuwanto, Garry, Susanto, Lucky, Wijaya, Derry Tanti, Winata, Genta Indra
We present MetaMetrics-MT, an innovative metric designed to evaluate machine translation (MT) tasks by aligning closely with human preferences through Bayesian optimization with Gaussian Processes. MetaMetrics-MT enhances existing MT metrics by optimizing their correlation with human judgments. Our experiments on the WMT24 metric shared task dataset demonstrate that MetaMetrics-MT outperforms all existing baselines, setting a new benchmark for state-of-the-art performance in the reference-based setting. Furthermore, it achieves comparable results to leading metrics in the reference-free setting, offering greater efficiency.