Goto

Collaborating Authors

 Zell, Andreas


FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech

arXiv.org Artificial Intelligence

Effective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gestures or language commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. In this paper, we introduce FAM-HRI, an efficient multi-modal framework for human-robot interaction that integrates language and gaze inputs via foundation models. By leveraging lightweight Meta ARIA glasses, our system captures real-time multi-modal signals and utilizes large language models (LLMs) to fuse user intention with scene context, enabling intuitive and precise robot manipulation. Our method accurately determines gaze fixation time interval, reducing noise caused by the gaze dynamic nature. Experimental evaluations demonstrate that FAM-HRI achieves a high success rate in task execution while maintaining a low interaction time, providing a practical solution for individuals with limited physical mobility or motor impairments.


The Role of Tactile Sensing for Learning Reach and Grasp

arXiv.org Artificial Intelligence

Stable and robust robotic grasping is essential for current and future robot applications. In recent works, the use of large datasets and supervised learning has enhanced speed and precision in antipodal grasping. However, these methods struggle with perception and calibration errors due to large planning horizons. To obtain more robust and reactive grasping motions, leveraging reinforcement learning combined with tactile sensing is a promising direction. Yet, there is no systematic evaluation of how the complexity of force-based tactile sensing affects the learning behavior for grasping tasks. This paper compares various tactile and environmental setups using two model-free reinforcement learning approaches for antipodal grasping. Our findings suggest that under imperfect visual perception, various tactile features improve learning outcomes, while complex tactile inputs complicate training.


An Event-Based Perception Pipeline for a Table Tennis Robot

arXiv.org Artificial Intelligence

Table tennis robots gained traction over the last years and have become a popular research challenge for control and perception algorithms. Fast and accurate ball detection is crucial for enabling a robotic arm to rally the ball back successfully. So far, most table tennis robots use conventional, frame-based cameras for the perception pipeline. However, frame-based cameras suffer from motion blur if the frame rate is not high enough for fast-moving objects. Event-based cameras, on the other hand, do not have this drawback since pixels report changes in intensity asynchronously and independently, leading to an event stream with a temporal resolution on the order of us. To the best of our knowledge, we present the first real-time perception pipeline for a table tennis robot that uses only event-based cameras. We show that compared to a frame-based pipeline, event-based perception pipelines have an update rate which is an order of magnitude higher. This is beneficial for the estimation and prediction of the ball's position, velocity, and spin, resulting in lower mean errors and uncertainties. These improvements are an advantage for the robot control, which has to be fast, given the short time a table tennis ball is flying until the robot has to hit back.


Approximate Supervised Object Distance Estimation on Unmanned Surface Vehicles

arXiv.org Artificial Intelligence

Unmanned surface vehicles (USVs) and boats are increasingly important in maritime operations, yet their deployment is limited due to costly sensors and complexity. LiDAR, radar, and depth cameras are either costly, yield sparse point clouds or are noisy, and require extensive calibration. Here, we introduce a novel approach for approximate distance estimation in USVs using supervised object detection. We collected a dataset comprising images with manually annotated bounding boxes and corresponding distance measurements. Leveraging this data, we propose a specialized branch of an object detection model, not only to detect objects but also to predict their distances from the USV. This method offers a cost-efficient and intuitive alternative to conventional distance measurement techniques, aligning more closely with human estimation capabilities. We demonstrate its application in a marine assistance system that alerts operators to nearby objects such as boats, buoys, or other waterborne hazards.


Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?

arXiv.org Artificial Intelligence

Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios. Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop). In particular, nuPlan seems to be an expressive evaluation method since it is based on real-world data and closed-loop, yet it mostly covers basic driving scenarios. This makes it difficult to judge a planner's capabilities to generalize to rarely-seen situations. Therefore, we propose a novel closed-loop benchmark interPlan containing several edge cases and challenging driving scenarios. We assess existing state-of-the-art planners on our benchmark and show that neither rule-based nor learning-based planners can safely navigate the interPlan scenarios. A recently evolving direction is the usage of foundation models like large language models (LLM) to handle generalization. We evaluate an LLM-only planner and introduce a novel hybrid planner that combines an LLM-based behavior planner with a rule-based motion planner that achieves state-of-the-art performance on our benchmark.


UAV-Assisted Maritime Search and Rescue: A Holistic Approach

arXiv.org Artificial Intelligence

In this paper, we explore the application of Unmanned Aerial Vehicles (UAVs) in maritime search and rescue (mSAR) missions, focusing on medium-sized fixed-wing drones and quadcopters. We address the challenges and limitations inherent in operating some of the different classes of UAVs, particularly in search operations. Our research includes the development of a comprehensive software framework designed to enhance the efficiency and efficacy of SAR operations. This framework combines preliminary detection onboard UAVs with advanced object detection at ground stations, aiming to reduce visual strain and improve decision-making for operators. It will be made publicly available upon publication. We conduct experiments to evaluate various Region of Interest (RoI) proposal methods, especially by imposing simulated limited bandwidth on them, an important consideration when flying remote or offshore operations. This forces the algorithm to prioritize some predictions over others.


Spiking Neural Networks for Fast-Moving Object Detection on Neuromorphic Hardware Devices Using an Event-Based Camera

arXiv.org Artificial Intelligence

Table tennis is a fast-paced and exhilarating sport that demands agility, precision, and fast reflexes. In recent years, robotic table tennis has become a popular research challenge for robot perception algorithms. Fast and accurate ball detection is crucial for enabling a robotic arm to rally the ball back successfully. Previous approaches have employed conventional frame-based cameras with Convolutional Neural Networks (CNNs) or traditional computer vision methods. In this paper, we propose a novel solution that combines an event-based camera with Spiking Neural Networks (SNNs) for ball detection. We use multiple state-of-the-art SNN frameworks and develop a SNN architecture for each of them, complying with their corresponding constraints. Additionally, we implement the SNN solution across multiple neuromorphic edge devices, conducting comparisons of their accuracies and run-times. This furnishes robotics researchers with a benchmark illustrating the capabilities achievable with each SNN framework and a corresponding neuromorphic edge device. Next to this comparison of SNN solutions for robots, we also show that an SNN on a neuromorphic edge device is able to run in real-time in a closed loop robotic system, a table tennis robot in our use case.


Evaluating UAV Path Planning Algorithms for Realistic Maritime Search and Rescue Missions

arXiv.org Artificial Intelligence

Abstract-- Unmanned Aerial Vehicles (UAVs) are emerging as very important tools in search and rescue (SAR) missions at sea, enabling swift and efficient deployment for locating individuals or vessels in distress. The successful execution of these critical missions heavily relies on effective path planning algorithms that navigate UAVs through complex maritime environments while considering dynamic factors such as water currents and wind flow. Furthermore, they need to account for the uncertainty in search target locations. However, existing path planning methods often fail to address the inherent uncertainty associated with the precise location of search targets and the uncertainty of oceanic forces. In this paper, we develop a framework to develop and investigate trajectory planning algorithms for maritime SAR scenarios employing UAVs. We adopt it to compare multiple planning strategies, some of them used in practical applications by the United States Coast Guard. Furthermore, we propose a novel planner that aims at bridging the gap between computation heavy, precise algorithms and lightweight strategies applicable to real-world scenarios.


A multi-modal table tennis robot system

arXiv.org Artificial Intelligence

Table tennis is a fast-paced and exhilarating sport that demands agility, precision, and lightning-fast reflexes. It is a sport enjoyed by millions of enthusiasts worldwide, ranging from casual players to professional athletes. In recent years, the fusion of technology and sports has led to the development of various training aids and innovations aimed at enhancing the skills of players and fostering their competitive edge. Among these technological advancements, table tennis robots have also emerged. While not yet able to compete with professional players, table tennis robots are an interesting research environment to bring perception and control algorithms towards their limits. Thus, it is not surprising, that more and more research groups use table tennis robots as a test bed for their algorithms [1][2][3][4].


Conditional Unscented Autoencoders for Trajectory Prediction

arXiv.org Artificial Intelligence

The \ac{CVAE} is one of the most widely-used models in trajectory prediction for \ac{AD}. It captures the interplay between a driving context and its ground-truth future into a probabilistic latent space and uses it to produce predictions. In this paper, we challenge key components of the CVAE. We leverage recent advances in the space of the VAE, the foundation of the CVAE, which show that a simple change in the sampling procedure can greatly benefit performance. We find that unscented sampling, which draws samples from any learned distribution in a deterministic manner, can naturally be better suited to trajectory prediction than potentially dangerous random sampling. We go further and offer additional improvements, including a more structured mixture latent space, as well as a novel, potentially more expressive way to do inference with CVAEs. We show wide applicability of our models by evaluating them on the INTERACTION prediction dataset, outperforming the state of the art, as well as at the task of image modeling on the CelebA dataset, outperforming the baseline vanilla CVAE. Code is available at https://github.com/boschresearch/cuae-prediction.