AITopics | Kouzelis, Theodoros

Collaborating Authors

Kouzelis, Theodoros

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Kouzelis, Theodoros, Kakogeorgiou, Ioannis, Gidaris, Spyros, Komodakis, Nikos

arXiv.org Artificial IntelligenceFeb-14-2025

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.

artificial intelligence, eq-vae, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.09509

Country: Europe (0.46)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, Theodoros, Katsouros, Vassilis

arXiv.org Artificial IntelligenceSep-21-2023

While great effort has been done, the data scarcity issue In recent years, datasets of paired audio and captions have enabled of audio captioning still withholds. The common datasets in AAC, remarkable success in automatically generating descriptions AudioCaps and Clotho, contain together 50k captions for training, for audio clips, namely Automated Audio Captioning (AAC). However, whereas 400k captions are provided in COCO caption [8] for image it is labor-intensive and time-consuming to collect a sufficient captioning. Kim et al. [9] observe that due to the limited data, prior number of paired audio and captions. Motivated by the recent arts design decoders with shallow layers that fail to learn generalized advances in Contrastive Language-Audio Pretraining (CLAP), language expressivity and are fitted to the small-scaled target we propose a weakly-supervised approach to train an AAC model dataset. Due to this issue, their performance radically decreases assuming only text data and a pre-trained CLAP model, alleviating when tested on out-of-domain data. Motivated by these limitations the need for paired target data. Our approach leverages the we present an approach to AAC that only requires a pre-trained similarity between audio and text embeddings in CLAP.

caption, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2309.12242

Country:

Europe (0.95)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Investigating Personalization Methods in Text to Music Generation

Plitsis, Manos, Kouzelis, Theodoros, Paraskevopoulos, Georgios, Katsouros, Vassilis, Panagakis, Yannis

arXiv.org Artificial IntelligenceSep-20-2023

In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community.

artificial intelligence, machine learning, similarity, (15 more...)

arXiv.org Artificial Intelligence

2309.1114

Country:

Europe > Greece (0.15)
North America > United States (0.14)
Europe > Spain (0.14)

Genre:

Questionnaire & Opinion Survey (0.55)
Research Report (0.51)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

Kouzelis, Theodoros, Paraskevopoulos, Georgios, Katsamanis, Athanasios, Katsouros, Vassilis

arXiv.org Artificial IntelligenceMay-30-2023

The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent speech cause rapid performance degradation for modern speech aligners, hindering the use of automatic approaches. In this work, we propose a simple and effective modification of alignment graph construction of CTC-based models using Weighted Finite State Transducers. The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment. During the graph construction, we allow the modeling of common speech disfluencies, i.e. repetitions and omissions. Further, we show that by assessing the degree of audio-text mismatch through the use of Oracle Error Rate, our method can be effectively used in the wild. Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements, particularly for recall, achieving a 23-25% relative improvement over our baselines.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.00996

Country:

Europe > Greece (0.14)
Europe > Czechia (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems A case study for Modern Greek

Paraskevopoulos, Georgios, Kouzelis, Theodoros, Rouvalis, Georgios, Katsamanis, Athanasios, Katsouros, Vassilis, Potamianos, Alexandros

arXiv.org Artificial IntelligenceDec-31-2022

Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a $120$ hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when a only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.

artificial intelligence, machine learning, sample-efficient unsupervised domain adaptation, (4 more...)

arXiv.org Artificial Intelligence

2301.00304

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback