AITopics | Kumar, Anurag

Plotting

Kumar, Anurag

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models

Kumar, Anurag, Paturi, Rohit, Afshan, Amber, Srinivasan, Sundararajan

arXiv.org Artificial IntelligenceJan-14-2025

Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD.

artificial intelligence, large language model, natural language, (12 more...)

arXiv.org Artificial Intelligence

2501.08421

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Bridging Context Gaps: Enhancing Comprehension in Long-Form Social Conversations Through Contextualized Excerpts

Mohanty, Shrestha, Xuan, Sarah, Jobraeel, Jacob, Kumar, Anurag, Roy, Deb, Kabbara, Jad

arXiv.org Artificial IntelligenceDec-27-2024

We focus on enhancing comprehension in small-group recorded conversations, which serve as a medium to bring people together and provide a space for sharing personal stories and experiences on crucial social matters. One way to parse and convey information from these conversations is by sharing highlighted excerpts in subsequent conversations. This can help promote a collective understanding of relevant issues, by highlighting perspectives and experiences to other groups of people who might otherwise be unfamiliar with and thus unable to relate to these experiences. The primary challenge that arises then is that excerpts taken from one conversation and shared in another setting might be missing crucial context or key elements that were previously introduced in the original conversation. This problem is exacerbated when conversations become lengthier and richer in themes and shared experiences. To address this, we explore how Large Language Models (LLMs) can enrich these excerpts by providing socially relevant context. We present approaches for effective contextualization to improve comprehension, readability, and empathy. We show significant improvements in understanding, as assessed through subjective and objective evaluations. While LLMs can offer valuable context, they struggle with capturing key social aspects. We release the Human-annotated Salient Excerpts (HSE) dataset to support future work. Additionally, we show how context-enriched excerpts can provide more focused and comprehensive conversation summaries.

excerpt, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.19966

Country:

North America > United States > Maine (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report (0.82)

Industry: Education > Educational Setting > K-12 Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Scaling Concept With Text-Guided Diffusion Models

Huang, Chao, Liang, Susan, Tang, Yunlong, Tian, Yapeng, Kumar, Anurag, Xu, Chenliang

arXiv.org Artificial IntelligenceOct-31-2024

Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog to a tiger). In this work, we explore a novel approach: instead of replacing a concept, can we enhance or suppress the concept itself? Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models. Leveraging this insight, we introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. To systematically evaluate our approach, we present the WeakConcept-10 dataset, where concepts are imperfect and need to be enhanced. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.

diffusion model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.24151

Genre: Research Report (0.84)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Kalkhorani, Vahid Ahmadi, Yu, Cheng, Kumar, Anurag, Tan, Ke, Xu, Buye, Wang, DeLiang

arXiv.org Artificial IntelligenceJun-17-2024

Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2406.11619

Country: North America > United States > Ohio (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Few Shot Class Incremental Learning using Vision-Language models

Kumar, Anurag, Bharti, Chinmay, Dutta, Saikat, Karanam, Srikrishna, Banerjee, Biplab

arXiv.org Artificial IntelligenceMay-2-2024

Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The challenge emerges in seamlessly integrating new classes with few samples into the training data, demanding the model to adeptly accommodate these additions without compromising its performance on base classes. To address this exigency, the research community has introduced several solutions under the realm of few-shot class incremental learning (FSCIL). In this study, we introduce an innovative FSCIL framework that utilizes language regularizer and subspace regularizer. During base training, the language regularizer helps incorporate semantic information extracted from a Vision-Language model. The subspace regularizer helps in facilitating the model's acquisition of nuanced connections between image and text semantics inherent to base classes during incremental training. Our proposed framework not only empowers the model to embrace novel classes with limited data, but also ensures the preservation of performance on base classes. To substantiate the efficacy of our approach, we conduct comprehensive experiments on three distinct FSCIL benchmarks, where our framework attains state-of-the-art performance.

machine learning, natural language, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2405.0104

Country:

Asia > India (0.47)
Asia > Middle East > Qatar (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

Shankar, Ravi, Tan, Ke, Xu, Buye, Kumar, Anurag

arXiv.org Artificial IntelligenceMar-2-2024

Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training.

artificial intelligence, enhancement, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2403.01369

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

A Low-Delay MAC for IoT Applications: Decentralized Optimal Scheduling of Queues without Explicit State Information Sharing

Mohan, Avinash, Chattopadhyay, Arpan, Vatsa, Shivam Vinayak, Kumar, Anurag

arXiv.org Artificial IntelligenceJun-20-2023

We consider a system of several collocated nodes sharing a time slotted wireless channel, and seek a MAC (medium access control) that (i) provides low mean delay, (ii) has distributed control (i.e., there is no central scheduler), and (iii) does not require explicit exchange of state information or control signals. The design of such MAC protocols must keep in mind the need for contention access at light traffic, and scheduled access in heavy traffic, leading to the long-standing interest in hybrid, adaptive MACs. Working in the discrete time setting, for the distributed MAC design, we consider a practical information structure where each node has local information and some common information obtained from overhearing. In this setting, "ZMAC" is an existing protocol that is hybrid and adaptive. We approach the problem via two steps (1) We show that it is sufficient for the policy to be "greedy" and "exhaustive". Limiting the policy to this class reduces the problem to obtaining a queue switching policy at queue emptiness instants. (2) Formulating the delay optimal scheduling as a POMDP (partially observed Markov decision process), we show that the optimal switching rule is Stochastic Largest Queue (SLQ). Using this theory as the basis, we then develop a practical distributed scheduler, QZMAC, which is also tunable. We implement QZMAC on standard off-the-shelf TelosB motes and also use simulations to compare QZMAC with the full-knowledge centralized scheduler, and with ZMAC. We use our implementation to study the impact of false detection while overhearing the common information, and the efficiency of QZMAC. Our simulation results show that the mean delay with QZMAC is close that of the full-knowledge centralized scheduler.

artificial intelligence, machine learning, queue, (18 more...)

arXiv.org Artificial Intelligence

2105.11213

Country:

Asia (0.68)
North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (0.65)

Industry:

Information Technology (0.66)
Telecommunications (0.46)

Technology:

Information Technology > Internet of Things (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Communications > Networks > Sensor Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.86)

Add feedback

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Mira, Rodrigo, Xu, Buye, Donley, Jacob, Kumar, Anurag, Petridis, Stavros, Ithapu, Vamsi Krishna, Pantic, Maja

arXiv.org Artificial IntelligenceMar-13-2023

Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios.

artificial intelligence, enhancement, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2211.10999

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Mapping for Single-channel Speech Enhancement

Chen, Kuan-Lin, Wong, Daniel D. E., Tan, Ke, Xu, Buye, Kumar, Anurag, Ithapu, Vamsi Krishna

arXiv.org Artificial IntelligenceMar-8-2023

Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular loss functions including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).

artificial intelligence, gaussian nll, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2211.08624

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

Yang, Muqiao, Konan, Joseph, Bick, David, Zeng, Yunyang, Han, Shuo, Kumar, Anurag, Watanabe, Shinji, Raj, Bhiksha

arXiv.org Artificial IntelligenceFeb-16-2023

Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non-differentiable, and we develop a neural network estimator that can accurately predict their time-series values across an utterance. We also model phoneme-specific weights for each feature, as the acoustic parameters are known to show different behavior in different phonemes. We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features. Experimentally we show that it improves speech enhancement workflows in both time-domain and time-frequency domain, as measured by standard evaluation metrics. We also provide an analysis of phoneme-dependent improvement on acoustic parameters, demonstrating the additional interpretability that our method provides. This analysis can suggest which features are currently the bottleneck for improvement.

acoustic parameter, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2302.08095

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback