Not enough data to create a plot.
Try a different view from the menu above.
Kragic, Danica
Grasping a Handful: Sequential Multi-Object Dexterous Grasp Generation
Lu, Haofei, Dong, Yifei, Weng, Zehang, Lundell, Jens, Kragic, Danica
-- We introduce the sequential multi-object robotic grasp sampling algorithm SeqGrasp that can robustly synthesize stable grasps on diverse objects using the robotic hand's partial Degrees of Freedom (DoF). We use SeqGrasp to construct the large-scale Allegro Hand sequential grasping dataset SeqDataset and use it for training the diffusion-based sequential grasp generator SeqDiffuser . We experimentally evaluate SeqGrasp and SeqDiffuser against the state-of-the-art non-sequential multi-object grasp generation method Multi-Grasp in simulation and on a real robot. Furthermore, SeqDiffuser is approximately 1000 times faster at generating grasps than SeqGrasp and MultiGrasp. Generation of dexterous grasps has been studied for a long time, both from a technical perspective on generating grasps on robots [1]-[11] and understanding human grasping [12]- [15]. Most of these methods rely on bringing the robotic hand close to the object and then simultaneously enveloping it with all fingers. While this strategy often results in efficient and successful grasp generation, it simplifies dexterous grasping to resemble parallel-jaw grasping, thereby underutilizing the many DoF of multi-fingered robotic hands [10]. In contrast, grasping multiple objects with a robotic hand, particularly in a sequential manner that mirrors human-like dexterity, as shown in Figure 1, is still an unsolved problem. In this work, we introduce SeqGrasp, a novel hand-agnostic algorithm for generating sequential multi-object grasps.
Pushing Everything Everywhere All At Once: Probabilistic Prehensile Pushing
Perugini, Patrizio, Lundell, Jens, Friedl, Katharina, Kragic, Danica
We address prehensile pushing, the problem of manipulating a grasped object by pushing against the environment. Our solution is an efficient nonlinear trajectory optimization problem relaxed from an exact mixed integer non-linear trajectory optimization formulation. The critical insight is recasting the external pushers (environment) as a discrete probability distribution instead of binary variables and minimizing the entropy of the distribution. The probabilistic reformulation allows all pushers to be used simultaneously, but at the optimum, the probability mass concentrates onto one due to the entropy minimization. We numerically compare our method against a state-of-the-art sampling-based baseline on a prehensile pushing task. The results demonstrate that our method finds trajectories 8 times faster and at a 20 times lower cost than the baseline. Finally, we demonstrate that a simulated and real Franka Panda robot can successfully manipulate different objects following the trajectories proposed by our method. Supplementary materials are available at https://probabilistic-prehensile-pushing.github.io/.
FLAME: A Federated Learning Benchmark for Robotic Manipulation
Betran, Santiago Bou, Longhini, Alberta, Vasco, Miguel, Zhang, Yuchong, Kragic, Danica
Recent progress in robotic manipulation has been fueled by large-scale datasets collected across diverse environments. Training robotic manipulation policies on these datasets is traditionally performed in a centralized manner, raising concerns regarding scalability, adaptability, and data privacy. While federated learning enables decentralized, privacy-preserving training, its application to robotic manipulation remains largely unexplored. We introduce FLAME (Federated Learning Across Manipulation Environments), the first benchmark designed for federated learning in robotic manipulation. FLAME consists of: (i) a set of large-scale datasets of over 160,000 expert demonstrations of multiple manipulation tasks, collected across a wide range of simulated environments; (ii) a training and evaluation framework for robotic policy learning in a federated setting. We evaluate standard federated learning algorithms in FLAME, showing their potential for distributed policy learning and highlighting key challenges. Our benchmark establishes a foundation for scalable, adaptive, and privacy-aware robotic learning.
S$^2$-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation
Yang, Quantao, Welle, Michael C., Kragic, Danica, Andersson, Olov
Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment \textit{instances} that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S$^2$-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S$^2$-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Full videos of all real-world experiments are available in the supplementary material.
Early Detection of Human Handover Intentions in Human-Robot Collaboration: Comparing EEG, Gaze, and Hand Motion
Khanna, Parag, Rajabi, Nona, Kanik, Sumeyra U. Demir, Kragic, Danica, Björkman, Mårten, Smith, Christian
Human-robot collaboration (HRC) relies on accurate and timely recognition of human intentions to ensure seamless interactions. Among common HRC tasks, human-to-robot object handovers have been studied extensively for planning the robot's actions during object reception, assuming the human intention for object handover. However, distinguishing handover intentions from other actions has received limited attention. Most research on handovers has focused on visually detecting motion trajectories, which often results in delays or false detections when trajectories overlap. This paper investigates whether human intentions for object handovers are reflected in non-movement-based physiological signals. We conduct a multimodal analysis comparing three data modalities: electroencephalogram (EEG), gaze, and hand-motion signals. Our study aims to distinguish between handover-intended human motions and non-handover motions in an HRC setting, evaluating each modality's performance in predicting and classifying these actions before and after human movement initiation. We develop and evaluate human intention detectors based on these modalities, comparing their accuracy and timing in identifying handover intentions. To the best of our knowledge, this is the first study to systematically develop and test intention detectors across multiple modalities within the same experimental context of human-robot handovers. Our analysis reveals that handover intention can be detected from all three modalities. Nevertheless, gaze signals are the earliest as well as the most accurate to classify the motion as intended for handover or non-handover.
LLM-Driven Augmented Reality Puppeteer: Controller-Free Voice-Commanded Robot Teleoperation
Zhang, Yuchong, Orthmann, Bastian, Welle, Michael C., Van Haastregt, Jonne, Kragic, Danica
The integration of robotics and augmented reality (AR) presents transformative opportunities for advancing human-robot interaction (HRI) by improving usability, intuitiveness, and accessibility. This work introduces a controller-free, LLM-driven voice-commanded AR puppeteering system, enabling users to teleoperate a robot by manipulating its virtual counterpart in real time. By leveraging natural language processing (NLP) and AR technologies, our system -- prototyped using Meta Quest 3 -- eliminates the need for physical controllers, enhancing ease of use while minimizing potential safety risks associated with direct robot operation. A preliminary user demonstration successfully validated the system's functionality, demonstrating its potential for safer, more intuitive, and immersive robotic control.
Humans Co-exist, So Must Embodied Artificial Agents
Kuehn, Hannah, La Delfa, Joseph, Vasco, Miguel, Kragic, Danica, Leite, Iolanda
Modern embodied artificial agents excel in static, predefined tasks but fall short in dynamic and long-term interactions with humans. On the other hand, humans can adapt and evolve continuously, exploiting the situated knowledge embedded in their environment and other agents, thus contributing to meaningful interactions. We introduce the concept of co-existence for embodied artificial agents and argues that it is a prerequisite for meaningful, long-term interaction with humans. We take inspiration from biology and design theory to understand how human and non-human organisms foster entities that co-exist within their specific niches. Finally, we propose key research directions for the machine learning community to foster co-existing embodied agents, focusing on the principles, hardware and learning methods responsible for shaping them.
Human-Aligned Image Models Improve Visual Decoding from the Brain
Rajabi, Nona, Ribeiro, Antônio H., Vasco, Miguel, Taleb, Farzaneh, Björkman, Mårten, Kragic, Danica
Decoding visual images from brain activity has significant potential for advancing brain-computer interaction and enhancing the understanding of human perception. Recent approaches align the representation spaces of images and brain activity to enable visual decoding. In this paper, we introduce the use of human-aligned image encoders to map brain signals to images. We hypothesize that these models more effectively capture perceptual attributes associated with the rapid visual stimuli presentations commonly used in visual brain data recording experiments. Our empirical results support this hypothesis, demonstrating that this simple modification improves image retrieval accuracy by up to 21% compared to state-of-the-art methods. Comprehensive experiments confirm consistent performance improvements across diverse EEG architectures, image encoders, alignment methods, participants, and brain imaging modalities.
Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision
Longhini, Alberta, Büsching, Marcel, Duisterhof, Bardienus P., Lundell, Jens, Ichnowski, Jeffrey, Björkman, Mårten, Kragic, Danica
Teaching robots to fold, drape, or manipulate deformable objects such as cloths is fundamental to unlock a variety of applications ranging from healthcare to domestic and industrial environments [1]. While considerable progress has been made in rigid-object manipulation, manipulating deformables poses unique challenges, including infinite-dimensional state spaces, complex physical dynamics, and state estimation of self-occluded configurations [2]. Specifically, the problem of state estimation has led existing works on visual manipulation to either rely exclusively on 2D images, overlooking the cloth's 3D structure [3, 4, 5], or to use 3D representations that neglect valuable information in RGB observations [6, 7, 8]. Prior work on cloth state estimation often relies on 3D particle-based representations derived from depth sensors, including graphs [9, 10] and point clouds [11]. While point clouds effectively capture the object's observable state, they lack comprehensive structural information [6].
A Riemannian Framework for Learning Reduced-order Lagrangian Dynamics
Friedl, Katharina, Jaquier, Noémie, Lundell, Jens, Asfour, Tamim, Kragic, Danica
By incorporating physical consistency as inductive bias, deep neural networks display increased generalization capabilities and data efficiency in learning nonlinear dynamic models. However, the complexity of these models generally increases with the system dimensionality, requiring larger datasets, more complex deep networks, and significant computational effort. We propose a novel geometric network architecture to learn physically-consistent reduced-order dynamic parameters that accurately describe the original high-dimensional system behavior. This is achieved by building on recent advances in model-order reduction and by adopting a Riemannian perspective to jointly learn a non-linear structure-preserving latent space and the associated low-dimensional dynamics. Our approach enables accurate long-term predictions of the high-dimensional dynamics of rigid and deformable systems with increased data efficiency by inferring interpretable and physically plausible reduced Lagrangian models.