AITopics | Mehta, Shivam

Collaborating Authors

Mehta, Shivam

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Make Some Noise: Towards LLM audio reasoning and generation using sound tokens

Mehta, Shivam, Jojic, Nebojsa, Gamper, Hannes

arXiv.org Artificial IntelligenceMar-28-2025

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

arxiv preprint arxiv, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49660.2025.10888809

2503.22275

Country: Europe (0.46)

Genre:

Research Report > Promising Solution (0.68)
Research Report > New Finding (0.48)

Industry:

Media > Music (0.47)
Leisure & Entertainment (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Matcha-TTS: A fast TTS architecture with conditional flow matching

Mehta, Shivam, Tu, Ruibo, Beskow, Jonas, Székely, Éva, Henter, Gustav Eje

arXiv.org Artificial IntelligenceJan-9-2024

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models.

machine learning, natural language, proc, (21 more...)

arXiv.org Artificial Intelligence

2309.03199

Country: Europe > Sweden (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Unified speech and gesture synthesis using flow matching

Mehta, Shivam, Tu, Ruibo, Alexanderson, Simon, Beskow, Jonas, Székely, Éva, Henter, Gustav Eje

arXiv.org Artificial IntelligenceJan-9-2024

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks. Please see https://shivammehta25.github.io/Match-TTSG/ for video examples and code.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2310.05181

Country: Europe > Sweden (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Deichler, Anna, Mehta, Shivam, Alexanderson, Simon, Beskow, Jonas

arXiv.org Artificial IntelligenceSep-11-2023

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

artificial intelligence, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3577190.3616117

2309.05455

Country:

Europe (0.96)
North America > United States > New York (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)

Add feedback

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Mehta, Shivam, Wang, Siyang, Alexanderson, Simon, Beskow, Jonas, Székely, Éva, Henter, Gustav Eje

arXiv.org Artificial IntelligenceAug-9-2023

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. Please see https://shivammehta25.github.io/Diff-TTSG/ for video examples, data, and code.

artificial intelligence, denoising probabilistic, speech and gesture synthesis, (1 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/SSW.2023-24

2306.09417

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech (0.53)

Add feedback

Prosody-controllable spontaneous TTS with neural HMMs

Lameris, Harm, Mehta, Shivam, Henter, Gustav Eje, Gustafson, Joakim, Székely, Éva

arXiv.org Artificial IntelligenceJun-1-2023

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system's capability of synthesizing two types of creaky voice. Audio samples are available at https://www.speech.kth.se/tts-demos/prosodic-hmm/

artificial intelligence, speech, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10097200

2211.13533

Country: Europe > Sweden (0.14)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

OverFlow: Putting flows on top of neural transducers for better TTS

Mehta, Shivam, Kirkland, Ambika, Lameris, Harm, Beskow, Jonas, Székely, Éva, Henter, Gustav Eje

arXiv.org Artificial IntelligenceMay-29-2023

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code.

artificial intelligence, neural transducer, speech synthesis, (2 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2023-1996

2211.06892

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.53)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.53)

Add feedback