Bojar, Ondřej
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
Papi, Sara, Polak, Peter, Bojar, Ondřej, Macháček, Dominik
Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker's speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends, and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions.
Continuous Rating as Reliable Human Evaluation of Simultaneous Speech Translation
Javorský, Dávid, Macháček, Dominik, Bojar, Ondřej
Simultaneous speech translation (SST) can be evaluated on simulated online events where human evaluators watch subtitled videos and continuously express their satisfaction by pressing buttons (so called Continuous Rating). Continuous Rating is easy to collect, but little is known about its reliability, or relation to comprehension of foreign language document by SST users. In this paper, we contrast Continuous Rating with factual questionnaires on judges with different levels of source language knowledge. Our results show that Continuous Rating is easy and reliable SST quality assessment if the judges have at least limited knowledge of the source language. Our study indicates users' preferences on subtitle layout and presentation style and, most importantly, provides a significant evidence that users with advanced source language knowledge prefer low latency over fewer re-translations.
Findings of the IWSLT 2024 Evaluation Campaign
Ahmad, Ibrahim Said, Anastasopoulos, Antonios, Bojar, Ondřej, Borg, Claudia, Carpuat, Marine, Cattoni, Roldano, Cettolo, Mauro, Chen, William, Dong, Qianqian, Federico, Marcello, Haddow, Barry, Javorský, Dávid, Krubiński, Mateusz, Lam, Tsz Kin, Ma, Xutai, Mathur, Prashant, Matusov, Evgeny, Maurya, Chandresh, McCrae, John, Murray, Kenton, Nakamura, Satoshi, Negri, Matteo, Niehues, Jan, Niu, Xing, Ojha, Atul Kr., Ortega, John, Papi, Sara, Polák, Peter, Pospíšil, Adam, Pecina, Pavel, Salesky, Elizabeth, Sethiya, Nivedita, Sarkar, Balaram, Shi, Jiatong, Sikasote, Claytone, Sperber, Matthias, Stüker, Sebastian, Sudoh, Katsuhito, Thompson, Brian, Turchi, Marco, Waibel, Alex, Watanabe, Shinji, Wilken, Patrick, Zemánek, Petr, Zevallos, Rodolfo
This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in 26 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.
Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers
Zavoral, Patrik, Variš, Dušan, Bojar, Ondřej
We study length-based generalization, whereby the novel out-of-distribution condition is induced solely by controlling The Transformer model has a tendency to overfit the range of the sequences in the training and validation various aspects of the training data, such as sets. This type of generalization is especially apparent in the overall sequence length. We study elementary tasks where the pattern is elementary, and therefore easily string edit functions using a defined set of identifiable by humans. For example, when we illustrate the error indicators to interpret the behaviour of the operation of string reversal on short strings, humans will sequence-to-sequence Transformer. We show that correctly reverse also a long string. Such elementary string generalization to shorter sequences is often possible, edit functions thus highlight the extent to which universal but confirm that longer sequences are highly approximators may be limited by data. The elementary problematic, although partially correct answers functions we experiment with are solvable using very small are often obtained. Additionally, we find that Transformers (1-2 layers, 1 attention head; Weiss et al., other structural characteristics of the sequences, 2021) and it is possible to construct such Transformers such as subsegment length, may be equally important.
Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation
Sperber, Matthias, Bojar, Ondřej, Haddow, Barry, Javorský, Dávid, Ma, Xutai, Negri, Matteo, Niehues, Jan, Polák, Peter, Salesky, Elizabeth, Sudoh, Katsuhito, Turchi, Marco
Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation.
Understanding the role of FFNs in driving multilingual behaviour in LLMs
Bhattacharya, Sunit, Bojar, Ondřej
Multilingualism in Large Language Models (LLMs) is an yet under-explored area. In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of a Large Language Model, examining its architecture, activation patterns, and processing mechanisms across languages. We introduce novel metrics to probe the model's multilingual behaviour at different layers and shed light on the impact of architectural choices on multilingual processing. Our findings reveal different patterns of multilinugal processing in the sublayers of Feed-Forward Networks of the models. Furthermore, we uncover the phenomenon of "over-layerization" in certain model configurations, where increasing layer depth without corresponding adjustments to other parameters may degrade model performance. Through comparisons within and across languages, we demonstrate the interplay between model architecture, layer depth, and multilingual processing capabilities of LLMs trained on multiple languages.
On Difficulties of Attention Factorization through Shared Memory
Yorsh, Uladzislau, Holeňa, Martin, Bojar, Ondřej, Herel, David
Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex input relationships. However, this mechanism's quadratic time and memory complexity poses challenges for larger inputs. Researchers are now investigating models like Linear Unified Nested Attention (Luna) or Memory Augmented Transformer, which leverage external learnable memory to either reduce the attention computation complexity down to linear, or to propagate information between chunks in chunk-wise processing. Our findings challenge the conventional thinking on these models, revealing that interfacing with the memory directly through an attention operation is suboptimal, and that the performance may be considerably improved by filtering the input signal before communicating with memory.
Quality and Quantity of Machine Translation References for Automated Metrics
Zouhar, Vilém, Bojar, Ondřej
Automatic machine translation metrics often use human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.
Evaluating Optimal Reference Translations
Zouhar, Vilém, Kloudová, Věra, Popel, Martin, Bojar, Ondřej
Machine translation (MT) is routinely evaluated using various segment-level similarity metrics against one or more reference translations. At the same time, reference translations acquired in the standard way are often criticized for their flaws of various types. For several high-resourced language pairs, MT quality reaches levels comparable to the quality of the reference translation (Freitag et al. 2022; Hassan et al. 2018) and sometimes MT even significantly surpasses humans in a particular evaluation setting (Popel et al. 2020). Given this, one could conclude that state-of-the-art MT has reached the point where reference-based evaluation is no longer reliable and we have to resort to other methods (such as targeted expert evaluation of particular outputs), even if they are costly, subjective and possibly impossible to automate. The narrow goal of the presented work is to allow for an "extension of the expiry date" for reference-based evaluation methods. In a broader perspective, we want to formulate a methodology for creating reference translations which avoid the often-observed deficiencies of "standard" or "professional" reference translations, be it multiple interfering phenomena, inappropriate expressions, ignorance of topic-focus articulation (information structure) or other abundant shortcomings in the translation, indicating their authors' insensitivity to the topic itself, but above all to the source and target language. To this end, we introduce so-called optimal reference translations (ORT), which are intended to represent optimal (ideal or excellent) human translations (should they be the subject of a translation quality evaluation).
Boosting Unsupervised Machine Translation with Pseudo-Parallel Data
Kvapilíková, Ivana, Bojar, Ondřej
After the great advancements in machine translation (MT) quality brought by neural MT (NMT; Bahdanau et al., 2015; Vaswani et al., 2017) trained on millions of pre-translated sentence pairs, there came a realization that parallel data is expensive and surely not available for most language pairs in the world. Researchers started focusing their attention on methods leveraging monolingual data for machine translation (Sennrich et al., 2016b) and even explored the extreme scenario of training a translation system in a completely unsupervised way with no parallel data at all (Artetxe et al., 2018b; Lample et al., 2018a). The recent impressive progress in language modeling did not leave the area of machine translation intact. However, the translation capabilities of large language models such as the latest GPT models (Brown et al., 2020) are weak for underrepresented languages (Hendy et al., 2023) and unsupervised MT aimed at low-resource languages still deserves special attention. There are two ways to approach machine translation trained exclusively on monolingual data. In the absence of parallel texts, the monolingual training sentences can either be coupled with their synthetic counterparts which are automatically generated through back-translation (Artetxe et al., 2018b; Lample et al., 2018a), or with authentic counterparts which are automatically selected from existing monolingual texts to be as close translations as possible (Ruiter et al., 2019). Researchers have successfully explored both of these avenues with the conclusion that it is indeed possible to train a functional MT system on monolingual texts only. However, little attention has been paid to combining the two approaches together. In this paper, we work with the standard framework for training unsupervised MT but we incorporate an additional training step where sentence pairs mined from monolingual corpora are used to train the model with a standard supervised MT objective.