Goto

Collaborating Authors

 Ugan, Enes Yavuz


PIER: A Novel Metric for Evaluating What Matters in Code-Switching

arXiv.org Artificial Intelligence

Code-switching, the alternation of languages within a single discourse, presents a significant challenge for Automatic Speech Recognition. Despite the unique nature of the task, performance is commonly measured with established metrics such as Word-Error-Rate (WER). However, in this paper, we question whether these general metrics accurately assess performance on code-switching. Specifically, using both Connectionist-Temporal-Classification and Encoder-Decoder models, we show fine-tuning on non-code-switched data from both matrix and embedded language improves classical metrics on code-switching test sets, although actual code-switched words worsen (as expected). Therefore, we propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest. We instantiate PIER on code-switched utterances and show that this more accurately describes the code-switching performance, showing huge room for improvement in future work. This focused evaluation allows for a more precise assessment of model performance, particularly in challenging aspects such as inter-word and intra-word code-switching.


Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

arXiv.org Artificial Intelligence

For several years, video conferencing tools have In this paper, we investigate the aforementioned found applications across different domains and scenario by developing a comprehensive system have been utilized for a variety of purposes. The comprising speaker filtering and segmentation, pandemic in 2020 resulted in a substantial increase ASR, text segmentation, multi-speaker TTS, and in their usage, particularly in the realms of business audio-driven talking face generation modules. The and education, as the employees have been working use-case scenario of this system is as follows: assuming from home and students have been participating in the existence of multiple speakers and their the lectures online. Yet the application scope of pre-recorded videos, the system, upon the initiation the video communication systems could be beyond of speakers' speech, distinguishes between these scenarios. Such systems prove invaluable in speakers and their respective utterances. Following facilitating natural communication under challenging this phase, the ASR transcribes the text, and each conditions where conventional communication segmented text derived from a text segmentation is restricted, such as deep-sea expeditions or lacking component, undergoes processing by the TTS module a stable broadband internet connection. By to generate synthesized speech. As transmitting enabling the generation of audio and video, users text proves to be the most straightforward and costeffective can engage in seamless communication.


End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

arXiv.org Artificial Intelligence

The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.


KIT's Multilingual Speech Translation System for IWSLT 2023

arXiv.org Artificial Intelligence

Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.


Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition

arXiv.org Artificial Intelligence

Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. While today's neural end-to-end (E2E) models deliver state-of-the-art performances on the task of automatic speech recognition (ASR) it is commonly known that these systems are very data-intensive. However, there is only a few transcribed and aligned CS speech available. To overcome this problem and train multilingual systems which can transcribe CS speech, we propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are concatenated. By using this training data, our E2E model improves on transcribing CS speech. It also surpasses monolingual models on monolingual tests. The results show that this augmentation technique can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.


Code-Switching without Switching: Language Agnostic End-to-End Speech Translation

arXiv.org Artificial Intelligence

We propose a) a Language Agnostic end-to-end Speech Translation model (LAST), and b) a data augmentation strategy to increase code-switching (CS) performance. With increasing globalization, multiple languages are increasingly used interchangeably during fluent speech. Such CS complicates traditional speech recognition and translation, as we must recognize which language was spoken first and then apply a language-dependent recognizer and subsequent translation component to generate the desired target language output. Such a pipeline introduces latency and errors. In this paper, we eliminate the need for that, by treating speech recognition and translation as one unified end-to-end speech translation problem. By training LAST with both input languages, we decode speech into one target language, regardless of the input language. LAST delivers comparable recognition and speech translation accuracy in monolingual usage, while reducing latency and error rate considerably when CS is observed.