AITopics | Ugan, Enes Yavuz

Collaborating Authors

Ugan, Enes Yavuz

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PIER: A Novel Metric for Evaluating What Matters in Code-Switching

Ugan, Enes Yavuz, Pham, Ngoc-Quan, Bärmann, Leonard, Waibel, Alex

arXiv.org Artificial IntelligenceJan-16-2025

Code-switching, the alternation of languages within a single discourse, presents a significant challenge for Automatic Speech Recognition. Despite the unique nature of the task, performance is commonly measured with established metrics such as Word-Error-Rate (WER). However, in this paper, we question whether these general metrics accurately assess performance on code-switching. Specifically, using both Connectionist-Temporal-Classification and Encoder-Decoder models, we show fine-tuning on non-code-switched data from both matrix and embedded language improves classical metrics on code-switching test sets, although actual code-switched words worsen (as expected). Therefore, we propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest. We instantiate PIER on code-switched utterances and show that this more accurately describes the code-switching performance, showing huge room for improvement in future work. This focused evaluation allows for a more precise assessment of model performance, particularly in challenging aspects such as inter-word and intra-word code-switching.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.09512

Country: Europe > Germany > Baden-Württemberg (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Add feedback

Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Eyiokur, Fevziye Irem, Huber, Christian, Nguyen, Thai-Binh, Nguyen, Tuan-Nam, Retkowski, Fabian, Ugan, Enes Yavuz, Yaman, Dogucan, Waibel, Alexander

arXiv.org Artificial IntelligenceOct-15-2024

For several years, video conferencing tools have In this paper, we investigate the aforementioned found applications across different domains and scenario by developing a comprehensive system have been utilized for a variety of purposes. The comprising speaker filtering and segmentation, pandemic in 2020 resulted in a substantial increase ASR, text segmentation, multi-speaker TTS, and in their usage, particularly in the realms of business audio-driven talking face generation modules. The and education, as the employees have been working use-case scenario of this system is as follows: assuming from home and students have been participating in the existence of multiple speakers and their the lectures online. Yet the application scope of pre-recorded videos, the system, upon the initiation the video communication systems could be beyond of speakers' speech, distinguishes between these scenarios. Such systems prove invaluable in speakers and their respective utterances. Following facilitating natural communication under challenging this phase, the ASR transcribes the text, and each conditions where conventional communication segmented text derived from a text segmentation is restricted, such as deep-sea expeditions or lacking component, undergoes processing by the TTS module a stable broadband internet connection. By to generate synthesized speech. As transmitting enabling the generation of audio and video, users text proves to be the most straightforward and costeffective can engage in seamless communication.

artificial intelligence, communication, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2410.11434

Country:

Europe > Germany (0.14)
North America > United States (0.14)
Europe > Spain (0.14)
(2 more...)

Genre: Research Report (0.40)

Industry: Education (0.54)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Communications > Collaboration (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

Huber, Christian, Dinh, Tu Anh, Mullov, Carlos, Pham, Ngoc Quan, Nguyen, Thai Binh, Retkowski, Fabian, Constantin, Stefan, Ugan, Enes Yavuz, Liu, Danni, Li, Zhaolin, Koneru, Sai, Niehues, Jan, Waibel, Alexander

arXiv.org Artificial IntelligenceOct-23-2023

The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2308.03415

Country:

Europe (0.93)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

KIT's Multilingual Speech Translation System for IWSLT 2023

Liu, Danni, Nguyen, Thai Binh, Koneru, Sai, Ugan, Enes Yavuz, Pham, Ngoc-Quan, Nguyen, Tuan-Nam, Dinh, Tu Anh, Mullov, Carlos, Waibel, Alexander, Niehues, Jan

arXiv.org Artificial IntelligenceJul-12-2023

Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2306.0532

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Industry: Education (0.55)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition

Ugan, Enes Yavuz, Huber, Christian, Hussain, Juan, Waibel, Alexander

arXiv.org Artificial IntelligenceJul-3-2023

Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. While today's neural end-to-end (E2E) models deliver state-of-the-art performances on the task of automatic speech recognition (ASR) it is commonly known that these systems are very data-intensive. However, there is only a few transcribed and aligned CS speech available. To overcome this problem and train multilingual systems which can transcribe CS speech, we propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are concatenated. By using this training data, our E2E model improves on transcribing CS speech. It also surpasses monolingual models on monolingual tests. The results show that this augmentation technique can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.

cs data, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2210.08992

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Code-Switching without Switching: Language Agnostic End-to-End Speech Translation

Huber, Christian, Ugan, Enes Yavuz, Waibel, Alexander

arXiv.org Artificial IntelligenceNov-9-2022

We propose a) a Language Agnostic end-to-end Speech Translation model (LAST), and b) a data augmentation strategy to increase code-switching (CS) performance. With increasing globalization, multiple languages are increasingly used interchangeably during fluent speech. Such CS complicates traditional speech recognition and translation, as we must recognize which language was spoken first and then apply a language-dependent recognizer and subsequent translation component to generate the desired target language output. Such a pipeline introduces latency and errors. In this paper, we eliminate the need for that, by treating speech recognition and translation as one unified end-to-end speech translation problem. By training LAST with both input languages, we decode speech into one target language, regardless of the input language. LAST delivers comparable recognition and speech translation accuracy in monolingual usage, while reducing latency and error rate considerably when CS is observed.

aber bevor ich ihnen zeige, artificial intelligence, natural language, (9 more...)

arXiv.org Artificial Intelligence

2210.01512

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback