AITopics | Mehrish, Ambuj

Collaborating Authors

Mehrish, Ambuj

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Zhang, Shaozuo, Mehrish, Ambuj, Li, Yingting, Poria, Soujanya

arXiv.org Artificial IntelligenceJan-10-2025

Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.06276

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Hung, Chia-Yu, Majumder, Navonil, Kong, Zhifeng, Mehrish, Ambuj, Valle, Rafael, Catanzaro, Bryan, Poria, Soujanya

arXiv.org Artificial IntelligenceDec-30-2024

A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. We open source all code and models to support further research in TTA generation. Audio plays a vital role in daily life and creative industries, from enhancing communication and storytelling to enriching experiences in music, sound effects, and podcasts. Recent advancements in text-to-audio (TTA) generation (Majumder et al., 2024; Ghosal et al., 2023; Liu et al., 2023; 2024b; Xue et al., 2024; Vyas et al., 2023; Huang et al., 2023b;a) and offer a transformative approach, enabling the automatic creation of diverse and expressive audio content directly from textual descriptions. This technology holds immense potential to streamline audio production workflows and unlock new possibilities in multimedia content creation. However, many existing models face challenges with controllability, occasionally struggling to fully capture the details in the input prompts, especially when the prompts are complex. This can sometimes result in generated audio that omits certain events or diverges from the user intent. At times, the generated audio may even contain input-adjacent, but unmentioned and unintended, events, that could be characterized as hallucinations. In contrast, the recent advancements in Large Language Models (LLMs) (Ouyang et al., 2022) have been significantly driven by the alignment stage after pre-training and supervised fine-tuning. This alignment stage, often leveraging reinforcement learning from human feedback (RLHF) or other reward-based optimization methods, endows the generated outputs with human preferences, ethical considerations, and task-specific requirements (Ouyang et al., 2022).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.21037

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech

Melechovsky, Jan, Mehrish, Ambuj, Sisman, Berrak, Herremans, Dorien

arXiv.org Artificial IntelligenceOct-17-2024

Recent advancements in Text-to-Speech (TTS) systems have enabled the generation of natural and expressive speech from textual input. Accented TTS aims to enhance user experience by making the synthesized speech more relatable to minority group listeners, and useful across various applications and context. Speech synthesis can further be made more flexible by allowing users to choose any combination of speaker identity and accent, resulting in a wide range of personalized speech outputs. Current models struggle to disentangle speaker and accent representation, making it difficult to accurately imitate different accents while maintaining the same speaker characteristics. We propose a novel approach to disentangle speaker and accent representations using multi-level variational autoencoders (ML-VAE) and vector quantization (VQ) to improve flexibility and enhance personalization in speech synthesis. Our proposed method addresses the challenge of effectively separating speaker and accent characteristics, enabling more fine-grained control over the synthesized speech. Code and speech samples are publicly available.

artificial intelligence, machine learning, speech synthesis, (13 more...)

arXiv.org Artificial Intelligence

2410.13342

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Hung, Chia-Yu, Majumder, Navonil, Mehrish, Ambuj, Poria, Soujanya

arXiv.org Artificial IntelligenceJul-8-2024

The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.15193

Genre: Research Report > New Finding (1.00)

Industry: Energy > Oil & Gas > Upstream (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Improving Text-To-Audio Models with Synthetic Captions

Kong, Zhifeng, Lee, Sang-gil, Ghosal, Deepanway, Majumder, Navonil, Mehrish, Ambuj, Valle, Rafael, Poria, Soujanya, Catanzaro, Bryan

arXiv.org Artificial IntelligenceJul-8-2024

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

caption, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.15487

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.34)

Add feedback

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

Melechovsky, Jan, Mehrish, Ambuj, Sisman, Berrak, Herremans, Dorien

arXiv.org Artificial IntelligenceJun-3-2024

With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The results show an improvement in accent conversion ability compared to the baseline.

accent classifier, artificial intelligence, machine learning, (10 more...)

arXiv.org Artificial Intelligence

2406.01018

Country: North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Li, Yingting, Bhardwaj, Rishabh, Mehrish, Ambuj, Cheng, Bo, Poria, Soujanya

arXiv.org Artificial IntelligenceApr-6-2024

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.

adaptation, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2404.04645

Country:

Asia > China (0.14)
Europe > Czechia (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Li, Xiang, Bu, Fan, Mehrish, Ambuj, Li, Yingting, Han, Jiale, Cheng, Bo, Poria, Soujanya

arXiv.org Artificial IntelligenceMar-31-2024

Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.

artificial intelligence, cm-tts, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2404.00569

Country: Asia > China (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.54)
Energy (0.46)
Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Ghosal, Deepanway, Majumder, Navonil, Mehrish, Ambuj, Poria, Soujanya

arXiv.org Artificial IntelligenceMay-29-2023

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2304.13731

Country: South America (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Music (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

Mehrish, Ambuj, Kashyap, Abhinav Ramesh, Yingting, Li, Majumder, Navonil, Poria, Soujanya

arXiv.org Artificial IntelligenceMay-29-2023

There are significant challenges for speaker adaptation in text-to-speech for languages that are not widely spoken or for speakers with accents or dialects that are not well-represented in the training data. To address this issue, we propose the use of the "mixture of adapters" method. This approach involves adding multiple adapters within a backbone-model layer to learn the unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests when using only one minute of data for each new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11% of the total model parameters). This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind. Overall, our proposed approach offers a promising solution to the speech synthesis techniques, particularly for adapting to speakers from diverse backgrounds.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2305.18028

Country: Asia > China (0.14)

Genre: Research Report > Promising Solution (0.48)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback