Arani, Elahe
GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving
Russell, Lloyd, Hu, Anthony, Bertoni, Lorenzo, Fedoseev, George, Shotton, Jamie, Arani, Elahe, Corrado, Gianluca
Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
Renz, Katrin, Chen, Long, Arani, Elahe, Sinavski, Oleg
Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.
CarLLaVA: Vision language models for camera-only closed-loop driving
Renz, Katrin, Chen, Long, Marcu, Ana-Maria, Hünermann, Jan, Hanotte, Benoit, Karnsund, Alice, Shotton, Jamie, Arani, Elahe, Sinavski, Oleg
In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.
Mitigating Interference in the Knowledge Continuum through Attention-Guided Incremental Learning
Bhat, Prashant, Renjith, Bharath, Arani, Elahe, Zonooz, Bahram
Continual learning (CL) remains a significant challenge for deep neural networks, as it is prone to forgetting previously acquired knowledge. Several approaches have been proposed in the literature, such as experience rehearsal, regularization, and parameter isolation, to address this problem. Although almost zero forgetting can be achieved in task-incremental learning, class-incremental learning remains highly challenging due to the problem of inter-task class separation. Limited access to previous task data makes it difficult to discriminate between classes of current and previous tasks. To address this issue, we propose'Attention-Guided Incremental Learning' (AGILE), a novel rehearsal-based CL approach that incorporates compact task attention to effectively reduce interference between tasks. AGILE utilizes lightweight, learnable task projection vectors to transform the latent representations of a shared task attention module toward task distribution. Through extensive empirical evaluation, we show that AGILE significantly improves generalization performance by mitigating task interference and outperforming rehearsal-based approaches in several CL scenarios. Furthermore, AGILE can scale well to a large number of tasks with minimal overhead while remaining well-calibrated with reduced task-recency bias. In recent years, deep neural networks (DNNs) have been shown to perform better than humans on certain specific tasks, such as Atari games (Silver et al., 2018) and classification (He et al., 2015). Although impressive, these models are trained on static data and are unable to adapt their behavior to novel tasks while maintaining performance on previous tasks when the data evolve over time (Fedus et al., 2020). Continual learning (CL) refers to a training paradigm in which DNNs are exposed to a sequence of tasks and are expected to learn potentially incrementally or online (Parisi et al., 2019). CL has remained one of the most daunting tasks for DNNs, as acquiring new information significantly deteriorates the performance of previously learned tasks, a phenomenon termed "catastrophic forgetting" (French, 1999; McCloskey & Cohen, 1989).
Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning
Sarfraz, Fahad, Zonooz, Bahram, Arani, Elahe
While humans excel at continual learning (CL), deep neural networks (DNNs) exhibit catastrophic forgetting. A salient feature of the brain that allows effective CL is that it utilizes multiple modalities for learning and inference, which is underexplored in DNNs. Therefore, we study the role and interactions of multiple modalities in mitigating forgetting and introduce a benchmark for multimodal continual learning. Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations. This makes the model less vulnerable to modality-specific regularities and considerably mitigates forgetting. Furthermore, we observe that individual modalities exhibit varying degrees of robustness to distribution shift. Finally, we propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality. Our method sets a strong baseline that enables both single-and multimodal inference. Our study provides a promising case for further exploring the role of multiple modalities in enabling CL and provides a standard benchmark for future research. Lifelong learning requires the learning agent to continuously adapt to new data while retaining and consolidating previously learned knowledge. This ability is essential for the deployment of deep neural networks (DNNs) in numerous real-world applications. However, one critical issue in enabling continual learning (CL) in DNNs is catastrophic forgetting, whereby the model drastically forgets previously acquired knowledge when required to learn new tasks in sequence (McCloskey & Cohen, 1989).
IMEX-Reg: Implicit-Explicit Regularization in the Function Space for Continual Learning
Bhat, Prashant, Renjith, Bharath, Arani, Elahe, Zonooz, Bahram
Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Although rehearsal-based approaches have been fairly successful in mitigating catastrophic forgetting, they suffer from overfitting on buffered samples and prior information loss, hindering generalization under low-buffer regimes. Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes. Specifically, we employ a two-pronged implicit-explicit regularization approach using contrastive representation learning (CRL) and consistency regularization. To further leverage the global relationship between representations learned using CRL, we propose a regularization strategy to guide the classifier toward the activation correlations in the unit hypersphere of the CRL. Our results show that IMEX-Reg significantly improves generalization performance and outperforms rehearsal-based approaches in several CL scenarios. It is also robust to natural and adversarial corruptions with less task-recency bias. Additionally, we provide theoretical insights to support our design decisions further.
Can We Break Free from Strong Data Augmentations in Self-Supervised Learning?
Gowda, Shruthi, Arani, Elahe, Zonooz, Bahram
Self-supervised learning (SSL) has emerged as a promising solution for addressing the challenge of limited labeled data in deep neural networks (DNNs), offering scalability potential. However, the impact of design dependencies within the SSL framework remains insufficiently investigated. In this study, we comprehensively explore SSL behavior across a spectrum of augmentations, revealing their crucial role in shaping SSL model performance and learning mechanisms. Leveraging these insights, we propose a novel learning approach that integrates prior knowledge, with the aim of curtailing the need for extensive data augmentations and thereby amplifying the efficacy of learned representations. Notably, our findings underscore that SSL models imbued with prior knowledge exhibit reduced texture bias, diminished reliance on shortcuts and augmentations, and improved robustness against both natural and adversarial corruptions. These findings not only illuminate a new direction in SSL research, but also pave the way for enhancing DNN performance while concurrently alleviating the imperative for intensive data augmentation, thereby enhancing scalability and realworld problem-solving capabilities. Deep neural networks (DNNs) have proven to be highly effective in encoding patterns in data distribution to produce powerful and rich representations that have improved generalization performance across various perception tasks, such as classification, detection, and segmentation. However, one of the major limitations is that DNNs are data-hungry and annotating millions of available data is expensive. Self-supervised learning (SSL) has been proposed as a promising solution to this issue, to enable the learning of useful representations without manual annotations. Self-supervised learning paradigm needs to ensure that the resulting features are generic to be applicable to a wide range of real-world applications. Various SSL methods, including pretext-based (Gidaris Figure 1: The impact of augmentations on SSL methods is et al., 2018; Noroozi & Favaro, 2016), critical: as removing strong augmentations from SSL training contrastive-based (Chen et al., 2020a; He et al., can result in a significant drop in their performance.
Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training
Gowda, Shruthi, Zonooz, Bahram, Arani, Elahe
Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate that selectively updating specific layers while preserving others can substantially enhance the network's learning capacity. We therefore propose CURE, a novel training framework that leverages a gradient prominence criterion to perform selective conservation, updating, and revision of weights. Importantly, CURE is designed to be dataset-and architecture-agnostic, ensuring its applicability across various scenarios. It effectively tackles both memorization and overfitting issues, thus enhancing the trade-off between robustness and generalization and additionally, this training approach also aids in mitigating "robust overfitting". Furthermore, our study provides valuable insights into the mechanisms of selective adversarial training and offers a promising avenue for future research. The susceptibility of deep neural networks (DNNs) to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) continues to present a substantial challenge in the field. Adversarial training has emerged as a promising strategy to enhance the robustness of DNNs against adversarial attacks (Madry et al., 2018; Zhang et al., 2019; Tramèr et al., 2018; Wang et al., 2019). However, transitioning from standard training with natural images to adversarial training introduces distinct behavior patterns. Despite the benefits of adversarial training in improving robustness, it often results in compromised performance on clean images, creating a noticeable trade-off between standard and adversarial generalization (Raghunathan et al., 2019). Another intriguing observation is that, in contrast to the standard setting, longer durations of adversarial training can paradoxically lead to reduced test performance. This generalization gap in robustness between training and testing data, commonly referred to as robust overfitting (Rice et al., 2020), is prevalent in adversarial training. Therefore, it is imperative to gain a deeper understanding of the underlying factors driving these behaviors to advance the development of reliable and trustworthy AI systems. Few studies have attempted to understand learning behavior in an adversarial setting.
Transformers in Unsupervised Structure-from-Motion
Chawla, Hemang, Varma, Arnav, Arani, Elahe, Zonooz, Bahram
Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks. Transformers are used predominantly for 2D vision tasks, including image classification, semantic segmentation, and object detection. However, robots and advanced driver assistance systems also require 3D scene understanding for decision making by extracting structure-from-motion (SfM). We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously. With experiments on KITTI and DDAD datasets, we demonstrate how to adapt different vision transformers and compare them against contemporary CNN-based methods. Our study shows that transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks.
Continual Learning of Unsupervised Monocular Depth from Videos
Chawla, Hemang, Varma, Arnav, Arani, Elahe, Zonooz, Bahram
Spatial scene understanding, including monocular depth estimation, is an important problem in various applications, such as robotics and autonomous driving. While improvements in unsupervised monocular depth estimation have potentially allowed models to be trained on diverse crowdsourced videos, this remains underexplored as most methods utilize the standard training protocol, wherein the models are trained from scratch on all data after new data is collected. Instead, continual training of models on sequentially collected data would significantly reduce computational and memory costs. Nevertheless, naive continual training leads to catastrophic forgetting, where the model performance deteriorates on older domains as it learns on newer domains, highlighting the trade-off between model stability and plasticity. While several techniques have been proposed to address this issue in image classification, the high-dimensional and spatiotemporally correlated outputs of depth estimation make it a distinct challenge. To the best of our knowledge, no framework or method currently exists focusing on the problem of continual learning in depth estimation. Thus, we introduce a framework that captures the challenges of continual unsupervised depth estimation (CUDE), and define the necessary metrics to evaluate model performance. We propose a rehearsal-based dual-memory method, MonoDepthCL, which utilizes spatiotemporal consistency for continual learning in depth estimation, even when the camera intrinsics are unknown.