Goto

Collaborating Authors

 Manocha, Dinesh


Human Trajectory Forecasting with Explainable Behavioral Uncertainty

arXiv.org Artificial Intelligence

While they provide many applications, e.g., social robots, self-driving excellent prediction accuracy, their black-box vehicles, etc (Bennewitz, Burgard, & Thrun, 2002; nature makes it difficult for humans to interpret Thrun, Burgard, & Fox, 2005), and therefore the learned underlying function. Comparatively, has been studied in areas from computer science, model-based methods are based on explicit systems physics, and mathematics to robotics and transportation parameterized as ordinary/partial/stochastic (Bendali-Braham, Weber, Forestier, differentiable equations (O/P/SDEs) (Dietrich Idoumghar, & Muller, 2021). Existing research et al., 2021) or rule-based systems (Helbing & largely falls into model-free and model-based Molnรกr, 1995). These models are explainable but methods. Model-free methods enjoy the strong less accurate in prediction (Yue, Manocha, & data-fitting capacity of data-driven models such Wang, 2022), as they do not benefit from training as statistical machine learning models (Wang, on data (or only on small amounts of data) Ondล™ej, & O'Sullivan, 2016b; Wang & O'Sullivan, and therefore are better fit in small data regime.


MMER: Multimodal Multi-task Learning for Speech Emotion Recognition

arXiv.org Artificial Intelligence

In this paper, we propose MMER, a novel Multimodal Multi-task learning approach for Speech Emotion Recognition. MMER leverages a novel multimodal network based on early-fusion and cross-modal self-attention between text and acoustic modalities and solves three novel auxiliary tasks for learning emotion recognition from spoken utterances. In practice, MMER outperforms all our baselines and achieves state-of-the-art performance on the IEMOCAP benchmark. Additionally, we conduct extensive ablation studies and results analysis to prove the effectiveness of our proposed approach.


ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER

arXiv.org Artificial Intelligence

Complex Named Entity Recognition (NER) is the task of detecting linguistically complex named entities in low-context text. In this paper, we present ACLM Attention-map aware keyword selection for Conditional Language Model fine-tuning), a novel data augmentation approach based on conditional generation to address the data scarcity problem in low-resource complex NER. ACLM alleviates the context-entity mismatch issue, a problem existing NER data augmentation techniques suffer from and often generates incoherent augmentations by placing complex named entities in the wrong context. ACLM builds on BART and is optimized on a novel text reconstruction or denoising task - we use selective masking (aided by attention maps) to retain the named entities and certain keywords in the input sentence that provide contextually relevant additional knowledge or hints about the named entities. Compared with other data augmentation strategies, ACLM can generate more diverse and coherent augmentations preserving the true word sense of complex entities in the sentence. We demonstrate the effectiveness of ACLM both qualitatively and quantitatively on monolingual, cross-lingual, and multilingual complex NER across various low-resource settings. ACLM outperforms all our neural baselines by a significant margin (1%-36%). In addition, we demonstrate the application of ACLM to other domains that suffer from data scarcity (e.g., biomedical). In practice, ACLM generates more effective and factual augmentations for these domains than prior methods. Code: https://github.com/Sreyan88/ACLM


SLICER: Learning universal audio representations using low-resource self-supervised pre-training

arXiv.org Artificial Intelligence

We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of clusters. In addition, we propose a novel mel-spectrogram augmentation procedure, k-mix, based on mixup, which does not require labels and aids unsupervised representation learning for audio. Overall, SLICER achieves state-of-the-art results on the LAPE Benchmark \cite{9868132}, significantly outperforming DeLoRes-M and other prior approaches, which are pre-trained on $10\times$ larger of unsupervised data. We will make all our codes available on GitHub.


MAST: Multiscale Audio Spectrogram Transformers

arXiv.org Artificial Intelligence

ABSTRACT We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST) [1]. Given an input audio spectrogram, we first patchify and project it into an initial temporal resolution and embedding dimension, post which the multiple stages in MAST progressively expand the embedding dimension while reducing the temporal resolution of the input. MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the To confirm our hypothesis on hierarchically structured LAPE Benchmark [2], achieving state-of-the-art results on natural signals, we highlight a key architectural design choice keyword spotting in Speech Commands. Additionally, our common across the best performing CNN-based architectures proposed SS-MAST achieves an absolute average improvement for audio classification in literature. With a spectrogram as of 2.6% over the previously proposed SSAST [3] This design choice for 1. INTRODUCTION pure-CNN models allows them to hierarchically learn simple low-level acoustic features in the lower stages aided by Natural signals such as speech and audio are hierarchically high temporal and low embedding dimensions to complex structured across various different timescales, spanning tens high-level acoustic features in the higher stages aided by low (e.g., phonemes) to hundreds (e.g., words) of milliseconds.


UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation

arXiv.org Artificial Intelligence

In this paper, we introduce UnFuSeD, a novel approach to leverage self-supervised learning and reduce the need for large amounts of labeled data for audio classification. Unlike prior works, which directly fine-tune a self-supervised pre-trained encoder on a target dataset, we use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. We first train an encoder using a novel self-supervised learning algorithm (SSL) on an unlabeled audio dataset. Then, we use that encoder to generate pseudo-labels on our target task dataset via clustering the extracted representations. These pseudo-labels are then used to guide self-distillation on a randomly initialized model, which we call unsupervised fine-tuning. Finally, the resultant encoder is then fine-tuned on our target task dataset. Through UnFuSeD, we propose the first system that moves away from generic SSL paradigms in literature, which pre-train and fine-tune the same encoder, and present a novel self-distillation-based system to leverage SSL pre-training for low-resource audio classification. In practice, UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines. Additionally, UnFuSeD allows us to achieve this at a 40% reduction in the number of parameters over the previous state-of-the-art system. We make all our codes publicly available.


BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER

arXiv.org Artificial Intelligence

Biomedical Named Entity Recognition (BioNER) is the fundamental task of identifying named entities from biomedical text. However, BioNER suffers from severe data scarcity and lacks high-quality labeled data due to the highly specialized and expert knowledge required for annotation. Though data augmentation has shown to be highly effective for low-resource NER in general, existing data augmentation techniques fail to produce factual and diverse augmentations for BioNER. In this paper, we present BioAug, a novel data augmentation framework for low-resource BioNER. BioAug, built on BART, is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation. Post training, we perform conditional generation and generate diverse augmentations conditioning BioAug on selectively corrupted text similar to the training stage. We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets and show that BioAug outperforms all our baselines by a significant margin (1.5%-21.5% absolute improvement) and is able to generate augmentations that are both more factual and diverse. Code: https://github.com/Sreyan88/BioAug.


GrASPE: Graph based Multimodal Fusion for Robot Navigation in Unstructured Outdoor Environments

arXiv.org Artificial Intelligence

We present a novel trajectory traversability estimation and planning algorithm for robot navigation in complex outdoor environments. We incorporate multimodal sensory inputs from an RGB camera, 3D LiDAR, and the robot's odometry sensor to train a prediction model to estimate candidate trajectories' success probabilities based on partially reliable multi-modal sensor observations. We encode high-dimensional multi-modal sensory inputs to low-dimensional feature vectors using encoder networks and represent them as a connected graph. The graph is then used to train an attention-based Graph Neural Network (GNN) to predict trajectory success probabilities. We further analyze the number of features in the image (corners) and point cloud data (edges and planes) separately to quantify their reliability to augment the weights of the feature graph representation used in our GNN. During runtime, our model utilizes multi-sensor inputs to predict the success probabilities of the trajectories generated by a local planner to avoid potential collisions and failures. Our algorithm demonstrates robust predictions when one or more sensor modalities are unreliable or unavailable in complex outdoor environments. We evaluate our algorithm's navigation performance using a Spot robot in real-world outdoor environments. We observe an increase of 10-30% in terms of navigation success rate and a 13-15% decrease in false positive estimations compared to the state-of-the-art navigation methods.


Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning

arXiv.org Artificial Intelligence

Model-based approaches to reinforcement learning (MBRL) exhibit favorable performance in practice, but their theoretical guarantees in large spaces are mostly restricted to the setting when transition model is Gaussian or Lipschitz, and demands a posterior estimate whose representational complexity grows unbounded with time. In this work, we develop a novel MBRL method (i) which relaxes the assumptions on the target transition model to belong to a generic family of mixture models; (ii) is applicable to large-scale training by incorporating a compression step such that the posterior estimate consists of a Bayesian coreset of only statistically significant past state-action pairs; and (iii) exhibits a sublinear Bayesian regret. To achieve these results, we adopt an approach based upon Stein's method, which, under a smoothness condition on the constructed posterior and target, allows distributional distance to be evaluated in closed form as the kernelized Stein discrepancy (KSD). The aforementioned compression step is then computed in terms of greedily retaining only those samples which are more than a certain KSD away from the previous model estimate. Experimentally, we observe that this approach is competitive with several state-of-the-art RL methodologies, and can achieve up-to 50 percent reduction in wall clock time in some continuous control environments.


Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

arXiv.org Artificial Intelligence

We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector. Moreover, we use a conditional generative adversarial network (CGAN) to generate acoustic effects from the scene latent vector. Our network is able to handle holes or other artifacts in the reconstructed 3D mesh model. We present an efficient cost function to the generator network to incorporate spatial audio effects. Given the source and the listener position, our learning-based binaural sound propagation approach can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX 2080 Ti GPU and can easily handle multiple sources. We have evaluated the accuracy of our approach with binaural acoustic effects generated using an interactive geometric sound propagation algorithm and captured real acoustic effects. We also performed a perceptual evaluation and observed that the audio rendered by our approach is more plausible as compared to audio rendered using prior learning-based sound propagation algorithms.