AITopics | Schlüter, Ralf

Collaborating Authors

Schlüter, Ralf

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Efficient Supernet Training with Orthogonal Softmax for Scalable ASR Model Compression

Xu, Jingjing, Beck, Eugen, Yang, Zijian, Schlüter, Ralf

arXiv.org Artificial IntelligenceFeb-4-2025

ASR systems are deployed across diverse environments, each with specific hardware constraints. We use supernet training to jointly train multiple encoders of varying sizes, enabling dynamic model size adjustment to fit hardware constraints without redundant training. Moreover, we introduce a novel method called OrthoSoftmax, which applies multiple orthogonal softmax functions to efficiently identify optimal subnets within the supernet, avoiding resource-intensive search. This approach also enables more flexible and precise subnet selection by allowing selection based on various criteria and levels of granularity. Our results with CTC on Librispeech and TED-LIUM-v2 show that FLOPs-aware component-wise selection achieves the best overall performance. With the same number of training updates from one single job, WERs for all model sizes are comparable to or slightly better than those of individually trained models. Furthermore, we analyze patterns in the selected components and reveal interesting insights.

efficient supernet training, orthogonal softmax, scalable asr model compression

arXiv.org Artificial Intelligence

2501.18895

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.60)

Add feedback

Classification Error Bound for Low Bayes Error Conditions in Machine Learning

Yang, Zijian, Eminyan, Vahe, Schlüter, Ralf, Ney, Hermann

arXiv.org Machine LearningJan-27-2025

In statistical classification and machine learning, classification error is an important performance measure, which is minimized by the Bayes decision rule. In practice, the unknown true distribution is usually replaced with a model distribution estimated from the training data in the Bayes decision rule. This substitution introduces a mismatch between the Bayes error and the model-based classification error. In this work, we apply classification error bounds to study the relationship between the error mismatch and the Kullback-Leibler divergence in machine learning. Motivated by recent observations of low model-based classification errors in many machine learning tasks, bounding the Bayes error to be lower, we propose a linear approximation of the classification error bound for low Bayes error conditions. Then, the bound for class priors are discussed. Moreover, we extend the classification error bound for sequences. Using automatic speech recognition as a representative example of machine learning applications, this work analytically discusses the correlations among different performance measures with extended bounds, including cross-entropy loss, language model perplexity, and word error rate.

artificial intelligence, bayesian inference, machine learning, (16 more...)

arXiv.org Machine Learning

2501.15977

Country: Europe > Spain (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.57)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.56)

Add feedback

The Conformer Encoder May Reverse the Time Dimension

Schmitt, Robin, Zeyer, Albert, Zeineldeen, Mohammad, Schlüter, Ralf, Ney, Hermann

arXiv.org Machine LearningOct-1-2024

We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models. Further investigation shows that the Conformer encoder internally reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose several methods and ideas of how this flipping can be avoided. Additionally, we investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.

available, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2410.0068

Country:

Europe (1.00)
North America > United States > Hawaii (0.14)
North America > United States > California (0.14)
Asia > Middle East > Qatar (0.14)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition

Rossenbach, Nick, Hilmes, Benedikt, Schlüter, Ralf

arXiv.org Artificial IntelligenceOct-12-2023

Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive (NAR) TTS. To get reference phoneme durations we use two common alignment methods, a hidden Markov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist temporal classification (CTC) aligner. Using a simple algorithm based on random walks we shift phoneme duration distributions of the TTS system closer to real durations, resulting in an improvement of an ASR system using synthetic data in a semi-supervised setting.

artificial intelligence, automatic speech recognition, phoneme duration variability, (2 more...)

arXiv.org Artificial Intelligence

2310.08132

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers

Yang, Zijian, Zhou, Wei, Schlüter, Ralf, Ney, Hermann

arXiv.org Artificial IntelligenceOct-11-2023

In this work, we investigate the effect of language models (LMs) with different context lengths and label units (phoneme vs. word) used in sequence discriminative training for phoneme-based neural transducers. Both lattice-free and N-best-list approaches are examined. For lattice-free methods with phoneme-level LMs, we propose a method to approximate the context history to employ LMs with full-context dependency. This approximation can be extended to arbitrary context length and enables the usage of word-level LMs in lattice-free methods. Moreover, a systematic comparison is conducted across lattice-free and N-best-list-based methods. Experimental results on Librispeech show that using the word-level LM in training outperforms the phoneme-level LM. Besides, we find that the context size of the LM used for probability computation has a limited effect on performance. Moreover, our results reveal the pivotal importance of the hypothesis space quality in sequence discriminative training.

language model, neural transducer, sequence discriminative training, (1 more...)

arXiv.org Artificial Intelligence

2310.07345

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.60)

Add feedback

End-to-End Training of a Neural HMM with Label and Transition Probabilities

Mann, Daniel, Raissi, Tina, Michel, Wilfried, Schlüter, Ralf, Ney, Hermann

arXiv.org Artificial IntelligenceOct-9-2023

We investigate a novel modeling approach for end-to-end neural network training using hidden Markov models (HMM) where the transition probabilities between hidden states are modeled and learned explicitly. Most contemporary sequence-to-sequence models allow for from-scratch training by summing over all possible label segmentations in a given topology. In our approach there are explicit, learnable probabilities for transitions between segments as opposed to a blank label that implicitly encodes duration statistics. We implement a GPU-based forward-backward algorithm that enables the simultaneous training of label and transition probabilities. We investigate recognition results and additionally Viterbi alignments of our models. We find that while the transition model training does not improve recognition performance, it has a positive impact on the alignment quality. The generated alignments are shown to be viable targets in state-of-the-art Viterbi trainings.

alignment, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2310.02724

Country: Europe > Germany (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers

Yang, Zijian, Zhou, Wei, Schlüter, Ralf, Ney, Hermann

arXiv.org Artificial IntelligenceSep-25-2023

Internal language model (ILM) subtraction has been widely applied to improve the performance of the RNN-Transducer with external language model (LM) fusion for speech recognition. In this work, we show that sequence discriminative training has a strong correlation with ILM subtraction from both theoretical and empirical points of view. Theoretically, we derive that the global optimum of maximum mutual information (MMI) training shares a similar formula as ILM subtraction. Empirically, we show that ILM subtraction and sequence discriminative training achieve similar performance across a wide range of experiments on Librispeech, including both MMI and minimum Bayes risk (MBR) criteria, as well as neural transducers and LMs of both full and limited context. The benefit of ILM subtraction also becomes much smaller after sequence discriminative training. We also provide an in-depth study to show that sequence discriminative training has a minimal effect on the commonly used zero-encoder ILM estimation, but a joint effect on both encoder and prediction + joint network for posterior probability reshaping including both ILM and blank suppression.

artificial intelligence, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2309.1413

Country: Europe > Germany (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.39)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

Development of Hybrid ASR Systems for Low Resource Medical Domain Conversational Telephone Speech

Lüscher, Christoph, Zeineldeen, Mohammad, Yang, Zijian, Raissi, Tina, Vieting, Peter, Le-Duc, Khai, Wang, Weiyue, Schlüter, Ralf, Ney, Hermann

arXiv.org Artificial IntelligenceSep-22-2023

Language barriers present a great challenge in our increasingly connected and global world. Especially within the medical domain, e.g. hospital or emergency room, communication difficulties and delays may lead to malpractice and non-optimal patient care. In the HYKIST project, we consider patient-physician communication, more specifically between a German-speaking physician and an Arabic- or Vietnamese-speaking patient. Currently, a doctor can call the Triaphon service to get assistance from an interpreter in order to help facilitate communication. The HYKIST goal is to support the usually non-professional bilingual interpreter with an automatic speech translation system to improve patient care and help overcome language barriers. In this work, we present our ASR system development efforts for this conversational telephone speech translation task in the medical domain for two languages pairs, data collection, various acoustic model architectures and dialect-induced difficulties.

interpreter, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2210.13397

Country:

Europe > Germany (0.29)
North America > United States (0.29)

Genre: Research Report (0.40)

Industry: Health & Medicine > Health Care Providers & Services (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Analyzing And Improving Neural Speaker Embeddings for ASR

Lüscher, Christoph, Xu, Jingjing, Zeineldeen, Mohammad, Schlüter, Ralf, Ney, Hermann

arXiv.org Machine LearningSep-20-2023

Neural speaker embeddings encode the speaker's speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, few studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved embedding extraction pipeline in combination with the Weighted-Simple-Add integration method results in x-vector and c-vector reaching on par performance with i-vectors. We further compare and analyze different speaker embeddings. We present our acoustic model improvements obtained by switching from newbob learning rate schedule to one cycle learning schedule resulting in a ~3% relative WER reduction on Switchboard, additionally reducing the overall training time by 17%. By further adding neural speaker embeddings, we gain additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2301.04571

Country:

Europe (1.00)
Asia (0.93)
North America > Canada > Alberta (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Vieting, Peter, Berger, Simon, von Neumann, Thilo, Boeddeker, Christoph, Schlüter, Ralf, Haeb-Umbach, Reinhold

arXiv.org Artificial IntelligenceSep-15-2023

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation. Previously, however, the method only addressed two-speaker scenarios. In this work, we extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. We evaluate the performance using different speech separators, including the powerful TF-GridNet model. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder. Furthermore, they demonstrate the strong separation of TF-GridNet which largely closes the gap between previous methods and oracle separation.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2309.08454

Country:

Europe (1.00)
Asia (0.94)
Oceania > Australia > Queensland (0.14)

Genre: Research Report (0.50)

Industry: Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback