Alumäe, Tanel
Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs
Fedorchenko, Artem, Alumäe, Tanel
For instance, Both iterative pseudo-labeling and LLM-based recent studies (Mykhalevych and Preply, 2024; post-editing have been an active area of research Kim et al., 2023) have revealed that 50% of Americans in the context of verbatim automatic speech and 85% of the Netflix users overall frequently recognition (ASR). Pseudo-labeling based semisupervised watch TV and streaming video content learning in ASR has been studied since with subtitles. Studies show that subtitles can enhance at least (Zavaliagkos et al., 1998) and has been understanding and memory retention. A lot later investigated in several works, e.g. by Veselỳ of viewers choose to enjoy their content quietly et al. (2013); Xu et al. (2020).
Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation
Sildam, Tiia, Velve, Andra, Alumäe, Tanel
This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.
Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge
Alumäe, Tanel, Kong, Jiaming, Robnikov, Daniil
This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge. The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data. TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data. In both tracks, we relied on wav2vec2.0 models. Our methodology diverges from the traditional procedure of finetuning pretrained wav2vec2.0 models in two key points: firstly, through the implementation of the aligned data augmentation technique to enhance the linguistic diversity of the training data, and secondly, via the application of deep prefix tuning for dialect adaptation of wav2vec2.0 models. In both tracks, our approach yielded significant improvements over the provided baselines, achieving the lowest word error rates across all participating teams.
Robust Training of Vector Quantized Bottleneck Models
Łańcucki, Adrian, Chorowski, Jan, Sanchez, Guillaume, Marxer, Ricard, Chen, Nanxin, Dolfing, Hans J. G. A., Khurana, Sameer, Alumäe, Tanel, Laurent, Antoine
In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE). However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper we focus on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts. It quantizes encoder outputs with on-line $k$-means clustering. We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs. We demonstrate that these can be successfully overcome by increasing the learning rate for the codebook and periodic date-dependent codeword re-initialization. As a result, we achieve more robust training across different tasks, and significantly increase the usage of latent codewords even for large codebooks. This has practical benefit, for instance, in unsupervised representation learning, where large codebooks may lead to disentanglement of latent representations.