AITopics | Scaramuzza, Davide

Collaborating Authors

Scaramuzza, Davide

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Accelerating Model-Based Reinforcement Learning with State-Space World Models

Krinner, Maria, Aljalbout, Elie, Romero, Angel, Scaramuzza, Davide

arXiv.org Machine LearningFeb-27-2025

Reinforcement learning (RL) is a powerful approach for robot learning. However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies. This is due to the noisy RL training updates and the complexity of robotic systems, which typically involve highly non-linear dynamics and noisy sensor signals. In contrast, model-based RL (MBRL) not only trains a policy but simultaneously learns a world model that captures the environment's dynamics and rewards. The world model can either be used for planning, for data collection, or to provide first-order policy gradients for training. Leveraging a world model significantly improves sample efficiency compared to model-free RL. However, training a world model alongside the policy increases the computational complexity, leading to longer training times that are often intractable for complex real-world scenarios. In this work, we propose a new method for accelerating model-based RL using state-space world models. Our approach leverages state-space models (SSMs) to parallelize the training of the dynamics model, which is typically the main computational bottleneck. Additionally, we propose an architecture that provides privileged information to the world model during training, which is particularly relevant for partially observable environments. We evaluate our method in several real-world agile quadrotor flight tasks, involving complex dynamics, for both fully and partially observable environments. We demonstrate a significant speedup, reducing the world model training time by up to 10 times, and the overall MBRL training time by up to 4 times. This benefit comes without compromising performance, as our method achieves similar sample efficiency and task rewards to state-of-the-art MBRL methods.

machine learning, reinforcement learning, world model, (14 more...)

arXiv.org Machine Learning

2502.20168

Country:

North America > United States > Pennsylvania (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

LiDAR Registration with Visual Foundation Models

Vödisch, Niclas, Cioffi, Giovanni, Cannici, Marco, Burgard, Wolfram, Scaramuzza, Davide

arXiv.org Artificial IntelligenceFeb-26-2025

LiDAR Registration with Visual Foundation Models Niclas V odisch 1,2, Giovanni Cioffi 2, Marco Cannici 2, Wolfram Burgard 3, and Davide Scaramuzza 2 1 University of Freiburg 2 University of Zurich 3 University of Technology Nuremberg Abstract --LiDAR registration is a fundamental task in robotic mapping and localization. A critical component of aligning two point clouds is identifying robust point correspondences using point descriptors. This step becomes particularly challenging in scenarios involving domain shifts, seasonal changes, and variations in point cloud structures. In this paper, we address these problems by proposing to use DINOv2 features, obtained from surround-view images, as point descriptors. We demonstrate that coupling these descriptors with traditional registration algorithms, such as RANSAC or ICP, facilitates robust 6DoF alignment of LiDAR scans with 3D maps, even when the map was recorded more than a year before. Although conceptually straightforward, our method substantially outperforms more complex baseline techniques. In contrast to previous learning-based point descriptors, our method does not require domain-specific retraining and is agnostic to the point cloud structure, effectively handling both sparse LiDAR scans and dense 3D maps. We show that leveraging the additional camera data enables our method to outperform the best baseline by +24.8 and +17. 3 registration recall on the NCL T and Oxford RobotCar datasets. We publicly release the registration benchmark and the code of our work on https://vfm-registration.cs.uni-freiburg.de. I NTRODUCTION Aligning two point clouds to compute their relative 3D transformation is a critical task in numerous robotic applications, including LiDAR odometry [30], loop closure registration [2], and map-based localization [19]. In this work, we specifically discuss map-based localization, which not only generalizes the other aforementioned tasks but is also critical for improving the efficiency and autonomy of mobile robots in environments where pre-existing map data is available.

descriptor, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.19374

Country:

Europe > Germany > Baden-Württemberg > Freiburg (0.44)
Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.24)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

A Monocular Event-Camera Motion Capture System

Bauersfeld, Leonard, Scaramuzza, Davide

arXiv.org Artificial IntelligenceFeb-17-2025

Motion capture systems are a widespread tool in research to record ground-truth poses of objects. Commercial systems use reflective markers attached to the object and then triangulate pose of the object from multiple camera views. Consequently, the object must be visible to multiple cameras which makes such multi-view motion capture systems unsuited for deployments in narrow, confined spaces (e.g. ballast tanks of ships). In this technical report we describe a monocular event-camera motion capture system which overcomes this limitation and is ideally suited for narrow spaces. Instead of passive markers it relies on active, blinking LED markers such that each marker can be uniquely identified from the blinking frequency. The markers are placed at known locations on the tracking object. We then solve the PnP (perspective-n-points) problem to obtain the position and orientation of the object. The developed system has millimeter accuracy, millisecond latency and we demonstrate that its state estimate can be used to fly a small, agile quadrotor.

artificial intelligence, led, video understanding, (18 more...)

arXiv.org Artificial Intelligence

2502.12113

Country:

North America > United States (0.46)
Europe > Switzerland (0.28)

Genre: Research Report (0.40)

Industry:

Media > Photography (0.72)
Media > Television (0.62)
Media > Film (0.62)

Technology:

Information Technology > Artificial Intelligence > Vision > Video Understanding (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.46)

Add feedback

Dream to Drive: Model-Based Vehicle Control Using Analytic World Models

Nachkov, Asen, Paudel, Danda Pani, Zaech, Jan-Nico, Scaramuzza, Davide, Van Gool, Luc

arXiv.org Artificial IntelligenceFeb-14-2025

Differentiable simulators have recently shown great promise for training autonomous vehicle controllers. Being able to backpropagate through them, they can be placed into an end-to-end training loop where their known dynamics turn into useful priors for the policy to learn, removing the typical black box assumption of the environment. So far, these systems have only been used to train policies. However, this is not the end of the story in terms of what they can offer. Here, for the first time, we use them to train world models. Specifically, we present three new task setups that allow us to learn next state predictors, optimal planners, and optimal inverse states. Unlike analytic policy gradients (APG), which requires the gradient of the next simulator state with respect to the current actions, our proposed setups rely on the gradient of the next state with respect to the current state. We call this approach Analytic World Models (AWMs) and showcase its applications, including how to use it for planning in the Waymax simulator. Apart from pushing the limits of what is possible with such simulators, we offer an improved training recipe that increases performance on the large-scale Waymo Open Motion dataset by up to 12% compared to baselines at essentially no additional cost.

artificial intelligence, simulator, trajectory, (16 more...)

arXiv.org Artificial Intelligence

2502.10012

Country: Europe > Switzerland (0.28)

Genre: Research Report (0.64)

Industry:

Transportation (0.67)
Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.82)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.66)

Add feedback

Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight

Romero, Angel, Shenai, Ashwin, Geles, Ismail, Aljalbout, Elie, Scaramuzza, Davide

arXiv.org Artificial IntelligenceJan-24-2025

Autonomous drone racing has risen as a challenging robotic benchmark for testing the limits of learning, perception, planning, and control. Expert human pilots are able to agilely fly a drone through a race track by mapping the real-time feed from a single onboard camera directly to control commands. Recent works in autonomous drone racing attempting direct pixel-to-commands control policies (without explicit state estimation) have relied on either intermediate representations that simplify the observation space or performed extensive bootstrapping using Imitation Learning (IL). This paper introduces an approach that learns policies from scratch, allowing a quadrotor to autonomously navigate a race track by directly mapping raw onboard camera pixels to control commands, just as human pilots do. By leveraging model-based reinforcement learning~(RL) - specifically DreamerV3 - we train visuomotor policies capable of agile flight through a race track using only raw pixel observations. While model-free RL methods such as PPO struggle to learn under these conditions, DreamerV3 efficiently acquires complex visuomotor behaviors. Moreover, because our policies learn directly from pixel inputs, the perception-aware reward term employed in previous RL approaches to guide the training process is no longer needed. Our experiments demonstrate in both simulation and real-world flight how the proposed approach can be deployed on agile quadrotors. This approach advances the frontier of vision-based autonomous flight and shows that model-based RL is a promising direction for real-world robotics.

artificial intelligence, learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2501.14377

Country:

North America > United States > California (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Air (1.00)
Leisure & Entertainment (1.00)
Information Technology > Robotics & Automation (0.90)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

GG-SSMs: Graph-Generating State Space Models

Zubić, Nikola, Scaramuzza, Davide

arXiv.org Artificial IntelligenceDec-16-2024

State Space Models (SSMs) are powerful tools for modeling sequential data in computer vision and time series analysis domains. However, traditional SSMs are limited by fixed, one-dimensional sequential processing, which restricts their ability to model non-local interactions in high-dimensional data. While methods like Mamba and VMamba introduce selective and flexible scanning strategies, they rely on predetermined paths, which fails to efficiently capture complex dependencies. We introduce Graph-Generating State Space Models (GG-SSMs), a novel framework that overcomes these limitations by dynamically constructing graphs based on feature relationships. Using Chazelle's Minimum Spanning Tree algorithm, GG-SSMs adapt to the inherent data structure, enabling robust feature propagation across dynamically generated graphs and efficiently modeling complex dependencies. We validate GG-SSMs on 11 diverse datasets, including event-based eye-tracking, ImageNet classification, optical flow estimation, and six time series datasets. GG-SSMs achieve state-of-the-art performance across all tasks, surpassing existing methods by significant margins. Specifically, GG-SSM attains a top-1 accuracy of 84.9% on ImageNet, outperforming prior SSMs by 1%, reducing the KITTI-15 error rate to 2.77%, and improving eye-tracking detection rates by up to 0.33% with fewer parameters. These results demonstrate that dynamic scanning based on feature relationships significantly improves SSMs' representational power and efficiency, offering a versatile tool for various applications in computer vision and beyond.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2412.12423

Country:

North America > United States (0.28)
Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Multi-Task Reinforcement Learning for Quadrotors

Xing, Jiaxu, Geles, Ismail, Song, Yunlong, Aljalbout, Elie, Scaramuzza, Davide

arXiv.org Artificial IntelligenceDec-16-2024

Abstract--Reinforcement learning (RL) has shown great effectiveness in quadrotor control, enabling specialized policies to develop even human-champion-level performance in singletask scenarios. To address this limitation, this paper presents a novel multi-task reinforcement learning (MTRL) framework tailored for quadrotor control, leveraging the shared physical dynamics of the platform to enhance sample efficiency and task performance. By employing a multi-critic architecture and shared task encoders, our framework facilitates knowledge transfer across tasks, enabling a single policy to execute diverse maneuvers, including high-speed stabilization, velocity tracking, and autonomous racing. Our experimental results, validated both in simulation and real-world scenarios, demonstrate that our framework outperforms baseline approaches in terms of sample efficiency and overall task performance. EAL world quadrotor applications typically involve multiple tasks and skills.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2412.12442

Country: Europe > Switzerland (0.28)

Genre: Research Report (0.50)

Industry: Energy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Drift-free Visual SLAM using Digital Twins

Merat, Roxane, Cioffi, Giovanni, Bauersfeld, Leonard, Scaramuzza, Davide

arXiv.org Artificial IntelligenceDec-12-2024

Globally-consistent localization in urban environments is crucial for autonomous systems such as self-driving vehicles and drones, as well as assistive technologies for visually impaired people. Traditional Visual-Inertial Odometry (VIO) and Visual Simultaneous Localization and Mapping (VSLAM) methods, though adequate for local pose estimation, suffer from drift in the long term due to reliance on local sensor data. While GPS counteracts this drift, it is unavailable indoors and often unreliable in urban areas. An alternative is to localize the camera to an existing 3D map using visual-feature matching. This can provide centimeter-level accurate localization but is limited by the visual similarities between the current view and the map. This paper introduces a novel approach that achieves accurate and globally-consistent localization by aligning the sparse 3D point cloud generated by the VIO/VSLAM system to a digital twin using point-to-plane matching; no visual data association is needed. The proposed method provides a 6-DoF global measurement tightly integrated into the VIO/VSLAM system. Experiments run on a high-fidelity GPS simulator and real-world data collected from a drone demonstrate that our approach outperforms state-of-the-art VIO-GPS systems and offers superior robustness against viewpoint changes compared to the state-of-the-art Visual SLAM systems.

artificial intelligence, global measurement, point cloud, (14 more...)

arXiv.org Artificial Intelligence

2412.08496

Country: Europe > Switzerland (0.29)

Genre: Research Report (0.84)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Student-Informed Teacher Training

Messikommer, Nico, Xing, Jiaxu, Aljalbout, Elie, Scaramuzza, Davide

arXiv.org Artificial IntelligenceDec-12-2024

Our method leverages three networks (a), which are trained in three alternating phases: the roll-out phase (b), the policy update phase (c), and the alignment phase (d). The grey boxes represent networks frozen during the specific phase and the dashed arrows indicate the gradient flow. Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks. In reinforcement learning (RL), an agent learns to perform a task by interacting with its environment and maximizing the cumulative rewards gained through these interactions. This work was supported by the European Research Council (ERC) under grant agreement No. 864042 (AGILEFLIGHT) However, this process requires extensive exploration, as the agent must avoid getting trapped in local minima, often resulting in a large number of environment interactions (Pathak et al., 2017). The number of interactions is even further increased when the agent processes high-dimensional data as input (Ota et al., 2020). Using such observations, the policy must learn to extract a notion of the agent's state, a process that is computationally expensive when optimized solely through RL.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2412.09149

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Leisure & Entertainment > Games (0.46)
Education > Teacher Education (0.41)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Monocular Event-Based Vision for Obstacle Avoidance with a Quadrotor

Bhattacharya, Anish, Cannici, Marco, Rao, Nishanth, Tao, Yuezhan, Kumar, Vijay, Matni, Nikolai, Scaramuzza, Davide

arXiv.org Artificial IntelligenceNov-5-2024

We present the first static-obstacle avoidance method for quadrotors using just an onboard, monocular event camera. Quadrotors are capable of fast and agile flight in cluttered environments when piloted manually, but vision-based autonomous flight in unknown environments is difficult in part due to the sensor limitations of traditional onboard cameras. Event cameras, however, promise nearly zero motion blur and high dynamic range, but produce a very large volume of events under significant ego-motion and further lack a continuous-time sensor model in simulation, making direct sim-to-real transfer not possible. By leveraging depth prediction as a pretext task in our learning framework, we can pre-train a reactive obstacle avoidance events-to-control policy with approximated, simulated events and then fine-tune the perception component with limited events-and-depth real-world data to achieve obstacle avoidance in indoor and outdoor settings. We demonstrate this across two quadrotor-event camera platforms in multiple settings and find, contrary to traditional vision-based works, that low speeds (1m/s) make the task harder and more prone to collisions, while high speeds (5m/s) result in better event-based depth estimation and avoidance. We also find that success rates in outdoor scenes can be significantly higher than in certain indoor scenes.

artificial intelligence, event camera, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2411.03303

Country: Europe (0.28)

Genre: Research Report (0.64)

Industry:

Information Technology > Robotics & Automation (0.66)
Transportation > Air (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback