Goto

Collaborating Authors

 Al-Halah, Ziad


A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

arXiv.org Artificial Intelligence

Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.


Few-Shot Audio-Visual Learning of Environment Acoustics

arXiv.org Artificial Intelligence

Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and -- in a major departure from traditional methods -- generalizing to novel environments in a few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.


Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

arXiv.org Artificial Intelligence

In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.


PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

arXiv.org Artificial Intelligence

State-of-the-art approaches to ObjectGoal navigation Prior work has made good progress on this task by rely on reinforcement learning and typically require significant formulating it as a reinforcement learning (RL) problem computational resources and time for learning. We and developing useful representations [20, 60], auxiliary propose Potential functions for ObjectGoal Navigation with tasks [61], data augmentation techniques [37], and improved Interaction-free learning (PONI), a modular approach that reward functions [37]. Despite this progress, end-toend disentangles the skills of'where to look?' for an object and RL incurs high computational cost, has poor sample efficiency, 'how to navigate to (x, y)?'. Our key insight is that'where and tends to generalize poorly to new scenes [7,12, to look?' can be treated purely as a perception problem, 37] since skills like moving without collisions, exploration, and learned without environment interactions. To address and stopping near the object are all learned from scratch this, we propose a network that predicts two complementary purely using RL. Modular navigation methods aim to address potential functions conditioned on a semantic map and uses these issues by disentangling'where to look for an object?' them to decide where to look for an unseen object. We train and'how to navigate to (x, y)?' [12,36]. These methods the potential function network using supervised learning on have emerged as strong competitors to end-to-end RL a passive dataset of top-down semantic maps, and integrate with good sample efficiency, better generalization to new it into a modular framework to perform ObjectGoal navigation.


Audio-Visual Waypoints for Navigation

arXiv.org Artificial Intelligence

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements 1) audio-visual waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on the challenging Replica environments of real-world 3D scenes. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.