conditioning vector
CogniPlan: Uncertainty-Guided Path Planning with Conditional Generative Layout Prediction
Wang, Yizhuo, He, Haodong, Liang, Jingsong, Cao, Yuhong, Chakraborty, Ritabrata, Sartoretti, Guillaume
Path planning in unknown environments is a crucial yet inherently challenging capability for mobile robots, which primarily encompasses two coupled tasks: autonomous exploration and point-goal navigation. In both cases, the robot must perceive the environment, update its belief, and accurately estimate potential information gain on-the-fly to guide planning. In this work, we propose CogniPlan, a novel path planning framework that leverages multiple plausible layouts predicted by a COnditional GeNerative Inpainting model, mirroring how humans rely on cognitive maps during navigation. These predictions, based on the partially observed map and a set of layout conditioning vectors, enable our planner to reason effectively under uncertainty. We demonstrate strong synergy between generative image-based layout prediction and graph-attention-based path planning, allowing CogniPlan to combine the scalability of graph representations with the fidelity and predictiveness of occupancy maps, yielding notable performance gains in both exploration and navigation. We extensively evaluate CogniPlan on two datasets (hundreds of maps and realistic floor plans), consistently outperforming state-of-the-art planners. We further deploy it in a high-fidelity simulator and on hardware, showcasing its high-quality path planning and real-world applicability.
Reviews: Robust Imitation of Diverse Behaviors
The paper proposes a deep-learning-based approach to imitation learning which is sample-efficient and is able to imitate many diverse behaviors. The architecture can be seen as conditional generative adversarial imitation learning (GAIL). The conditioning vector is an embedding of a demonstrated trajectory, provided by a variational autoencoder. This results in one-shot imitation learning: at test time, a new demonstration can be embedded and provided as a conditioning vector to the imitation policy. The authors evaluate the method on several simulated motor control tasks.
Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues
Bilalpur, Maneesh, Inan, Mert, Zeinali, Dorsa, Cohn, Jeffrey F., Alikhani, Malihe
Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-effective supplement to traditional caregiving methods. Crucial to these agents' effectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information offers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.
Image retrieval outperforms diffusion models on data augmentation
Burg, Max F., Wenzel, Florian, Zietlow, Dominik, Horn, Max, Makansi, Osama, Locatello, Francesco, Russell, Chris
Many approaches have been proposed to use diffusion models to augment training datasets for downstream tasks, such as classification. However, diffusion models are themselves trained on large datasets, often with noisy annotations, and it remains an open question to which extent these models contribute to downstream classification performance. In particular, it remains unclear if they generalize enough to improve over directly using the additional data of their pre-training process for augmentation. We systematically evaluate a range of existing methods to generate images from diffusion models and study new extensions to assess their benefit for data augmentation. Personalizing diffusion models towards the target data outperforms simpler prompting strategies. However, using the pre-training data of the diffusion model alone, via a simple nearest-neighbor retrieval procedure, leads to even stronger downstream performance. Our study explores the potential of diffusion models in generating new training data, and surprisingly finds that these sophisticated models are not yet able to beat a simple and strong image retrieval baseline on simple downstream vision tasks.
Adversarial synthesis based data-augmentation for code-switched spoken language identification
Shastri, Parth, Patil, Chirag, Wanere, Poorval, Mahajan, Shrinivas, Bhatt, Abhishek, Sailor, Hardik
Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition(ASR) that is used to classify the language(s) in an audio segment. Automatic LID plays an useful role in multilingual countries. In various countries, identifying a language becomes hard, due to the multilingual scenario where two or more than two languages are mixed together during conversation. Such phenomenon of speech is called as code-mixing or code-switching. This nature is followed not only in India but also in many Asian countries. Such code-mixed data is hard to find, which further reduces the capabilities of the spoken LID. Hence, this work primarily addresses this problem using data augmentation as a solution on the on the data scarcity of the code-switched class. This study focuses on Indic language code-mixed with English. Spoken LID is performed on Hindi, code-mixed with English. This research proposes Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data. GANs have already been proven to be accurate in representing the real data distribution in the image domain. Proposed research exploits these capabilities of GANs in speech domains such as speech classification, automatic speech recognition, etc. GANs are trained to generate Mel spectrograms of the minority code-mixed class which are then used to augment data for the classifier. Utilizing GANs give an overall improvement on Unweighted Average Recall by an amount of 3.5% as compared to a Convolutional Recurrent Neural Network (CRNN) classifier used as the baseline reference.
Generative Modeling of Complex Data
Canale, Luca, Grislain, Nicolas, Lothe, Grégoire, Leduc, Johan
In recent years, several models have improved the capacity to generate synthetic tabular datasets. However, such models focus on synthesizing simple columnar tables and are not useable on real-life data with complex structures. This paper puts forward a generic framework to synthesize more complex data structures with composite and nested types. It then proposes one practical implementation, built with causal transformers, for struct (mappings of types) and lists (repeated instances of a type). The results on standard benchmark datasets show that such implementation consistently outperforms current state-of-the-art models both in terms of machine learning utility and statistical similarity. Moreover, it shows very strong results on two complex hierarchical datasets with multiple nesting and sparse data, that were previously out of reach.
Handling Variable-Dimensional Time Series with Graph Neural Networks
Gupta, Vibhor, Narwariya, Jyoti, Malhotra, Pankaj, Vig, Lovekesh, Shroff, Gautam
Several applications of Internet of Things (IoT) technology involve capturing data from multiple sensors resulting in multi-sensor time series. Existing neural networks based approaches for such multi-sensor or multivariate time series modeling assume fixed input dimension or number of sensors. Such approaches can struggle in the practical setting where different instances of the same device or equipment such as mobiles, wearables, engines, etc. come with different combinations of installed sensors. We consider training neural network models from such multi-sensor time series, where the time series have varying input dimensionality owing to availability or installation of a different subset of sensors at each source of time series. We propose a novel neural network architecture suitable for zero-shot transfer learning allowing robust inference for multivariate time series with previously unseen combination of available dimensions or sensors at test time. Such a combinatorial generalization is achieved by conditioning the layers of a core neural network-based time series model with a "conditioning vector" that carries information of the available combination of sensors for each time series. This conditioning vector is obtained by summarizing the set of learned "sensor embedding vectors" corresponding to the available sensors in a time series via a graph neural network. We evaluate the proposed approach on publicly available activity recognition and equipment prognostics datasets, and show that the proposed approach allows for better generalization in comparison to a deep gated recurrent neural network baseline.
Deep Context-Aware Novelty Detection
Rushe, Ellen, Mac Namee, Brian
A common assumption of novelty detection is that the distribution of both "normal" and "novel" data are static. However, this is often not the case in scenarios where data evolves over time, or when the definition of normal and novel depends on contextual information, leading to changes in these distributions. This can lead to significant difficulties when attempting to train a model on datasets where the distribution of normal data in one scenario is similar to that of novel data in another scenario. In this paper we propose a context-aware approach to novelty detection for deep autoencoders. We create a semi-supervised network architecture which utilises auxiliary labels in order to reveal contextual information and allows the model to adapt to a variety of normal and novel scenarios. We evaluate our approach on both synthetic image data and real world audio data displaying these characteristics.
GANGogh: Creating Art with GANs – Towards Data Science – Medium
The work here presented is the result of a semester long independent research performed by Kenny Jones and Derrick Bonafilia (both Williams College 2017) under the guidance of Professor Andrea Danyluk. Kenny and Derrick are both heading to Facebook next year as Software Engineers and hope to continue studying GANs in whatever capacity is available to them. Generative Adversarial Networks (GANS) were introduced by Ian Goodfellow et. GANs address the lack of relative success of deep generative models compared to deep discriminative models. The authors cite the intractable nature of the maximum likelihood estimation that is necessary for most generative models as the reason for this discrepancy.
Conditional Generation and Snapshot Learning in Neural Dialogue Systems
Wen, Tsung-Hsien, Gasic, Milica, Mrksic, Nikola, Rojas-Barahona, Lina M., Su, Pei-Hao, Ultes, Stefan, Vandyke, David, Young, Steve
Recently a variety of LSTM-based conditional language models (LM) have been applied across a range of language generation tasks. In this work we study various model architectures and different ways to represent and aggregate the source information in an end-to-end neural dialogue system framework. A method called snapshot learning is also proposed to facilitate learning from supervised sequential signals by applying a companion cross-entropy objective function to the conditioning vector. The experimental and analytical results demonstrate firstly that competition occurs between the conditioning vector and the LM, and the differing architectures provide different trade-offs between the two. Secondly, the discriminative power and transparency of the conditioning vector is key to providing both model interpretability and better performance. Thirdly, snapshot learning leads to consistent performance improvements independent of which architecture is used.