Goto

Collaborating Authors

 Coheur, Luisa


xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection

arXiv.org Artificial Intelligence

Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xCOMET, an open-source learned metric designed to bridge the gap between these approaches. xCOMET integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.


Scaling up COMETKIWI: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task

arXiv.org Artificial Intelligence

We present the joint contribution of Unbabel and Instituto Superior T\'ecnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2). For all tasks, we build on the COMETKIWI-22 model (Rei et al., 2022b). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art COMETKIWI-22, we show large improvements in correlation with human judgements (up to 10 Spearman points). Moreover, we surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points.


Fuzzy Fingerprinting Transformer Language-Models for Emotion Recognition in Conversations

arXiv.org Artificial Intelligence

Fuzzy Fingerprints have been successfully used as an interpretable text classification technique, but, like most other techniques, have been largely surpassed in performance by Large Pre-trained Language Models, such as BERT or RoBERTa. These models deliver state-of-the-art results in several Natural Language Processing tasks, namely Emotion Recognition in Conversations (ERC), but suffer from the lack of interpretability and explainability. In this paper, we propose to combine the two approaches to perform ERC, as a means to obtain simpler and more interpretable Large Language Models-based classifiers. We propose to feed the utterances and their previous conversational turns to a pre-trained RoBERTa, obtaining contextual embedding utterance representations, that are then supplied to an adapted Fuzzy Fingerprint classification module. We validate our approach on the widely used DailyDialog ERC benchmark dataset, in which we obtain state-of-the-art level results using a much lighter model.


Towards a Fully Unsupervised Framework for Intent Induction in Customer Support Dialogues

arXiv.org Artificial Intelligence

The evolution of technology has allowed the automation of several processes across diversified engineering industry fields, such as customer support services, which have drastically evolved with the advances in Natural Language Processing and Machine Learning. One of the major challenges of these systems is to identify users intentions, a complex Natural Language Understanding task, that vary across domains. With the evolution of Deep Learning architectures, recent works focused on modelling intentions and creating a taxonomy of intents, so they can be fed to powerful supervised clustering algorithms (Haponchyk et al., 2020; Chatterjee and Sengupta, 2021). However, these systems have the bottleneck of requiring the existence of labelled data to be trained and deployed, and, thus, they can not be easily transferred to real world customer support services, where the available data for a commercial chatbot usually consists in no more than a dataset of interactions between clients and operators. As labeling hundreds of utterances with intent labels can be time-consuming, laborious, expensive and, sometimes, even requires someone with expertise, it is not straightforward to apply current state of the art supervised models to new domains (Chatterjee and Sengupta, 2020).


Enhancing Portuguese Sign Language Animation with Dynamic Timing and Mouthing

arXiv.org Artificial Intelligence

Current signing avatars are often described as unnatural as they cannot accurately reproduce all the subtleties of synchronized body behaviors of a human signer. In this paper, we propose a new dynamic approach for transitions between signs, focusing on mouthing animations for Portuguese Sign Language. Although native signers preferred animations with dynamic transitions, we did not find significant differences in comprehension and perceived naturalness scores. On the other hand, we show that including mouthing behaviors improved comprehension and perceived naturalness for novice sign language learners. Results have implications in computational linguistics, human-computer interaction, and synthetic animation of signing avatars.


The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

arXiv.org Artificial Intelligence

Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, "black boxes" returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors, as assessed through comparison of token-level neural saliency maps with Multidimensional Quality Metrics (MQM) annotations and with synthetically-generated critical translation errors. To ease future research, we release our code at: https://github.com/Unbabel/COMET/tree/explainable-metrics.


Using Implicit Feedback to Improve Question Generation

arXiv.org Artificial Intelligence

Question Generation (QG) is a task of Natural Language Processing (NLP) that aims at automatically generating questions from text. Many applications can benefit from automatically generated questions, but often it is necessary to curate those questions, either by selecting or editing them. This task is informative on its own, but it is typically done post-generation, and, thus, the effort is wasted. In addition, most existing systems cannot incorporate this feedback back into them easily. In this work, we present a system, GEN, that learns from such (implicit) feedback. Following a pattern-based approach, it takes as input a small set of sentence/question pairs and creates patterns which are then applied to new unseen sentences. Each generated question, after being corrected by the user, is used as a new seed in the next iteration, so more patterns are created each time. We also take advantage of the corrections made by the user to score the patterns and therefore rank the generated questions. Results show that GEN is able to improve by learning from both levels of implicit feedback when compared to the version with no learning, considering the top 5, 10, and 20 questions. Improvements go up from 10%, depending on the metric and strategy used.