AITopics | Arani, Elahe

Collaborating Authors

Arani, Elahe

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Russell, Lloyd, Hu, Anthony, Bertoni, Lorenzo, Fedoseev, George, Shotton, Jamie, Arani, Elahe, Corrado, Gianluca

arXiv.org Artificial IntelligenceMar-26-2025

Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.20523

Country: Europe > Germany (0.24)

Genre: Research Report (0.64)

Industry: Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Renz, Katrin, Chen, Long, Arani, Elahe, Sinavski, Oleg

arXiv.org Artificial IntelligenceMar-12-2025

Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.09594

Genre: Research Report (0.82)

Industry:

Transportation > Ground > Road (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Systems and Facilities > Geothermal System for Power Generation > Advanced Geothermal System (AGS) (0.61)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

CarLLaVA: Vision language models for camera-only closed-loop driving

Renz, Katrin, Chen, Long, Marcu, Ana-Maria, Hünermann, Jan, Hanotte, Benoit, Karnsund, Alice, Shotton, Jamie, Arani, Elahe, Sinavski, Oleg

arXiv.org Artificial IntelligenceJun-14-2024

In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2406.10165

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Systems and Facilities > Geothermal System for Power Generation > Advanced Geothermal System (AGS) (0.61)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.78)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Mitigating Interference in the Knowledge Continuum through Attention-Guided Incremental Learning

Bhat, Prashant, Renjith, Bharath, Arani, Elahe, Zonooz, Bahram

arXiv.org Artificial IntelligenceMay-22-2024

Continual learning (CL) remains a significant challenge for deep neural networks, as it is prone to forgetting previously acquired knowledge. Several approaches have been proposed in the literature, such as experience rehearsal, regularization, and parameter isolation, to address this problem. Although almost zero forgetting can be achieved in task-incremental learning, class-incremental learning remains highly challenging due to the problem of inter-task class separation. Limited access to previous task data makes it difficult to discriminate between classes of current and previous tasks. To address this issue, we propose'Attention-Guided Incremental Learning' (AGILE), a novel rehearsal-based CL approach that incorporates compact task attention to effectively reduce interference between tasks. AGILE utilizes lightweight, learnable task projection vectors to transform the latent representations of a shared task attention module toward task distribution. Through extensive empirical evaluation, we show that AGILE significantly improves generalization performance by mitigating task interference and outperforming rehearsal-based approaches in several CL scenarios. Furthermore, AGILE can scale well to a large number of tasks with minimal overhead while remaining well-calibrated with reduced task-recency bias. In recent years, deep neural networks (DNNs) have been shown to perform better than humans on certain specific tasks, such as Atari games (Silver et al., 2018) and classification (He et al., 2015). Although impressive, these models are trained on static data and are unable to adapt their behavior to novel tasks while maintaining performance on previous tasks when the data evolve over time (Fedus et al., 2020). Continual learning (CL) refers to a training paradigm in which DNNs are exposed to a sequence of tasks and are expected to learn potentially incrementally or online (Parisi et al., 2019). CL has remained one of the most daunting tasks for DNNs, as acquiring new information significantly deteriorates the performance of previously learned tasks, a phenomenon termed "catastrophic forgetting" (French, 1999; McCloskey & Cohen, 1989).

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2405.13978

Genre: Research Report (0.40)

Industry:

Leisure & Entertainment > Games > Computer Games (0.54)
Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning

Sarfraz, Fahad, Zonooz, Bahram, Arani, Elahe

arXiv.org Artificial IntelligenceMay-4-2024

While humans excel at continual learning (CL), deep neural networks (DNNs) exhibit catastrophic forgetting. A salient feature of the brain that allows effective CL is that it utilizes multiple modalities for learning and inference, which is underexplored in DNNs. Therefore, we study the role and interactions of multiple modalities in mitigating forgetting and introduce a benchmark for multimodal continual learning. Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations. This makes the model less vulnerable to modality-specific regularities and considerably mitigates forgetting. Furthermore, we observe that individual modalities exhibit varying degrees of robustness to distribution shift. Finally, we propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality. Our method sets a strong baseline that enables both single-and multimodal inference. Our study provides a promising case for further exploring the role of multiple modalities in enabling CL and provides a standard benchmark for future research. Lifelong learning requires the learning agent to continuously adapt to new data while retaining and consolidating previously learned knowledge. This ability is essential for the deployment of deep neural networks (DNNs) in numerous real-world applications. However, one critical issue in enabling continual learning (CL) in DNNs is catastrophic forgetting, whereby the model drastically forgets previously acquired knowledge when required to learn new tasks in sequence (McCloskey & Cohen, 1989).

artificial intelligence, machine learning, modality, (16 more...)

arXiv.org Artificial Intelligence

2405.02766

Country:

Europe > United Kingdom (0.14)
Europe > Netherlands (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Sports (1.00)
Education > Educational Setting > Continuing Education (0.62)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

IMEX-Reg: Implicit-Explicit Regularization in the Function Space for Continual Learning

Bhat, Prashant, Renjith, Bharath, Arani, Elahe, Zonooz, Bahram

arXiv.org Artificial IntelligenceApr-28-2024

Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Although rehearsal-based approaches have been fairly successful in mitigating catastrophic forgetting, they suffer from overfitting on buffered samples and prior information loss, hindering generalization under low-buffer regimes. Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes. Specifically, we employ a two-pronged implicit-explicit regularization approach using contrastive representation learning (CRL) and consistency regularization. To further leverage the global relationship between representations learned using CRL, we propose a regularization strategy to guide the classifier toward the activation correlations in the unit hypersphere of the CRL. Our results show that IMEX-Reg significantly improves generalization performance and outperforms rehearsal-based approaches in several CL scenarios. It is also robust to natural and adversarial corruptions with less task-recency bias. Additionally, we provide theoretical insights to support our design decisions further.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2404.18161

Country: Europe > Netherlands (0.14)

Genre: Research Report > New Finding (0.54)

Industry:

Education (0.46)
Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Can We Break Free from Strong Data Augmentations in Self-Supervised Learning?

Gowda, Shruthi, Arani, Elahe, Zonooz, Bahram

arXiv.org Artificial IntelligenceApr-15-2024

Self-supervised learning (SSL) has emerged as a promising solution for addressing the challenge of limited labeled data in deep neural networks (DNNs), offering scalability potential. However, the impact of design dependencies within the SSL framework remains insufficiently investigated. In this study, we comprehensively explore SSL behavior across a spectrum of augmentations, revealing their crucial role in shaping SSL model performance and learning mechanisms. Leveraging these insights, we propose a novel learning approach that integrates prior knowledge, with the aim of curtailing the need for extensive data augmentations and thereby amplifying the efficacy of learned representations. Notably, our findings underscore that SSL models imbued with prior knowledge exhibit reduced texture bias, diminished reliance on shortcuts and augmentations, and improved robustness against both natural and adversarial corruptions. These findings not only illuminate a new direction in SSL research, but also pave the way for enhancing DNN performance while concurrently alleviating the imperative for intensive data augmentation, thereby enhancing scalability and realworld problem-solving capabilities. Deep neural networks (DNNs) have proven to be highly effective in encoding patterns in data distribution to produce powerful and rich representations that have improved generalization performance across various perception tasks, such as classification, detection, and segmentation. However, one of the major limitations is that DNNs are data-hungry and annotating millions of available data is expensive. Self-supervised learning (SSL) has been proposed as a promising solution to this issue, to enable the learning of useful representations without manual annotations. Self-supervised learning paradigm needs to ensure that the resulting features are generic to be applicable to a wide range of real-world applications. Various SSL methods, including pretext-based (Gidaris Figure 1: The impact of augmentations on SSL methods is et al., 2018; Noroozi & Favaro, 2016), critical: as removing strong augmentations from SSL training contrastive-based (Chen et al., 2020a; He et al., can result in a significant drop in their performance.

artificial intelligence, inductive learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2404.09752

Country:

Europe > Netherlands (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training

Gowda, Shruthi, Zonooz, Bahram, Arani, Elahe

arXiv.org Artificial IntelligenceJan-26-2024

Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate that selectively updating specific layers while preserving others can substantially enhance the network's learning capacity. We therefore propose CURE, a novel training framework that leverages a gradient prominence criterion to perform selective conservation, updating, and revision of weights. Importantly, CURE is designed to be dataset-and architecture-agnostic, ensuring its applicability across various scenarios. It effectively tackles both memorization and overfitting issues, thus enhancing the trade-off between robustness and generalization and additionally, this training approach also aids in mitigating "robust overfitting". Furthermore, our study provides valuable insights into the mechanisms of selective adversarial training and offers a promising avenue for future research. The susceptibility of deep neural networks (DNNs) to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) continues to present a substantial challenge in the field. Adversarial training has emerged as a promising strategy to enhance the robustness of DNNs against adversarial attacks (Madry et al., 2018; Zhang et al., 2019; Tramèr et al., 2018; Wang et al., 2019). However, transitioning from standard training with natural images to adversarial training introduces distinct behavior patterns. Despite the benefits of adversarial training in improving robustness, it often results in compromised performance on clean images, creating a noticeable trade-off between standard and adversarial generalization (Raghunathan et al., 2019). Another intriguing observation is that, in contrast to the standard setting, longer durations of adversarial training can paradoxically lead to reduced test performance. This generalization gap in robustness between training and testing data, commonly referred to as robust overfitting (Rice et al., 2020), is prevalent in adversarial training. Therefore, it is imperative to gain a deeper understanding of the underlying factors driving these behaviors to advance the development of reliable and trustworthy AI systems. Few studies have attempted to understand learning behavior in an adversarial setting.

accuracy, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2401.14948

Country: Europe > Netherlands (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (0.88)
Government > Military (0.74)
Health & Medicine (0.74)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Transformers in Unsupervised Structure-from-Motion

Chawla, Hemang, Varma, Arnav, Arani, Elahe, Zonooz, Bahram

arXiv.org Artificial IntelligenceDec-16-2023

Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks. Transformers are used predominantly for 2D vision tasks, including image classification, semantic segmentation, and object detection. However, robots and advanced driver assistance systems also require 3D scene understanding for decision making by extracting structure-from-motion (SfM). We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously. With experiments on KITTI and DDAD datasets, we demonstrate how to adapt different vision transformers and compare them against contemporary CNN-based methods. Our study shows that transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks.

artificial intelligence, estimation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-45725-8_14

2312.10529

Country:

Europe > Netherlands (0.14)
North America > United States (0.14)
North America > Canada (0.14)
(2 more...)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (0.36)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Continual Learning of Unsupervised Monocular Depth from Videos

Chawla, Hemang, Varma, Arnav, Arani, Elahe, Zonooz, Bahram

arXiv.org Artificial IntelligenceNov-4-2023

Spatial scene understanding, including monocular depth estimation, is an important problem in various applications, such as robotics and autonomous driving. While improvements in unsupervised monocular depth estimation have potentially allowed models to be trained on diverse crowdsourced videos, this remains underexplored as most methods utilize the standard training protocol, wherein the models are trained from scratch on all data after new data is collected. Instead, continual training of models on sequentially collected data would significantly reduce computational and memory costs. Nevertheless, naive continual training leads to catastrophic forgetting, where the model performance deteriorates on older domains as it learns on newer domains, highlighting the trade-off between model stability and plasticity. While several techniques have been proposed to address this issue in image classification, the high-dimensional and spatiotemporally correlated outputs of depth estimation make it a distinct challenge. To the best of our knowledge, no framework or method currently exists focusing on the problem of continual learning in depth estimation. Thus, we introduce a framework that captures the challenges of continual unsupervised depth estimation (CUDE), and define the necessary metrics to evaluate model performance. We propose a rehearsal-based dual-memory method, MonoDepthCL, which utilizes spatiotemporal consistency for continual learning in depth estimation, even when the camera intrinsics are unknown.

artificial intelligence, depth estimation, image understanding, (17 more...)

arXiv.org Artificial Intelligence

2311.02393

Country:

Europe (1.00)
North America > United States (0.28)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.34)
Information Technology > Robotics & Automation (0.34)
Automobiles & Trucks (0.34)

Technology: Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)

Add feedback