Goto

Collaborating Authors

 Kouzelis, Theodoros


EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

arXiv.org Artificial Intelligence

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.


Weakly-supervised Automated Audio Captioning via text only training

arXiv.org Artificial Intelligence

While great effort has been done, the data scarcity issue In recent years, datasets of paired audio and captions have enabled of audio captioning still withholds. The common datasets in AAC, remarkable success in automatically generating descriptions AudioCaps and Clotho, contain together 50k captions for training, for audio clips, namely Automated Audio Captioning (AAC). However, whereas 400k captions are provided in COCO caption [8] for image it is labor-intensive and time-consuming to collect a sufficient captioning. Kim et al. [9] observe that due to the limited data, prior number of paired audio and captions. Motivated by the recent arts design decoders with shallow layers that fail to learn generalized advances in Contrastive Language-Audio Pretraining (CLAP), language expressivity and are fitted to the small-scaled target we propose a weakly-supervised approach to train an AAC model dataset. Due to this issue, their performance radically decreases assuming only text data and a pre-trained CLAP model, alleviating when tested on out-of-domain data. Motivated by these limitations the need for paired target data. Our approach leverages the we present an approach to AAC that only requires a pre-trained similarity between audio and text embeddings in CLAP.


Investigating Personalization Methods in Text to Music Generation

arXiv.org Artificial Intelligence

In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community.


Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

arXiv.org Artificial Intelligence

The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent speech cause rapid performance degradation for modern speech aligners, hindering the use of automatic approaches. In this work, we propose a simple and effective modification of alignment graph construction of CTC-based models using Weighted Finite State Transducers. The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment. During the graph construction, we allow the modeling of common speech disfluencies, i.e. repetitions and omissions. Further, we show that by assessing the degree of audio-text mismatch through the use of Oracle Error Rate, our method can be effectively used in the wild. Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements, particularly for recall, achieving a 23-25% relative improvement over our baselines.


Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems A case study for Modern Greek

arXiv.org Artificial Intelligence

Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a $120$ hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when a only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.