Goto

Collaborating Authors

 droid


SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Nikolov, Nikolay, Albanese, Giuliano, Dey, Sombit, Yanev, Aleksandar, Van Gool, Luc, Zaech, Jan-Nico, Paudel, Danda Pani

arXiv.org Artificial Intelligence

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.


Bird or droid? Starlings nail R2-D2 beeps and boops.

Popular Science

The songbirds are even better at mimicking the'Star Wars' robot than parrots. Breakthroughs, discoveries, and DIY tips sent every weekday. Songbirds like parrots and parakeets might be well known for squeaking out embarrassing one-liners and certain four-letter words, but those aren't the only sounds they can mimic. Birds have been observed copying dog barks, car alarms, and even chainsaws . But it turns out some species are better equipped to copy the droid's high-pitched beeps and boops than others.


Why human-shaped robots loom large in Musk's Tesla plans

BBC News

Why human-shaped robots loom large in Musk's Tesla plans It has appeared in Tesla showrooms, on its factory floors and has even posed with Kim Kardashian. But Elon Musk's vision for his human-like robot Optimus is much grander than that. Since first unveiling it at a Tesla showcase in 2022, the tech billionaire has suggested his company's droid could play a huge role in the homes and lives of people all over the world. Along with self-driving robotaxis and Cybertrucks, Musk believes Tesla robots are key to establishing a foothold in the artificial intelligence (AI) landscape. And investors who signed off on his $1tn (£760bn) pay package on Thursday would appear to agree .


DROID: Dual Representation for Out-of-Scope Intent Detection

Rashwan, Wael, Zawbaa, Hossam M., Dutta, Sourav, Assem, Haytham

arXiv.org Artificial Intelligence

Abstract--Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders--the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. T o enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6-15% for known and 8-20% for OOS intents, with the largest gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems. ONVERSA TIONAL AI systems are a primary interface for user assistance across sectors such as customer service, healthcare, and finance. A core requirement is intent classification--mapping utterances to predefined intents so downstream components can act appropriately [1].


Grandfather builds the droids he was always looking for

Popular Science

Kurt Zimmerman brought Star Wars from a galaxy far, far away to Michigan. Kurt makes his droids out of wood, but they're filled and painted to look like metal. Breakthroughs, discoveries, and DIY tips sent every weekday. The wood exploded into a million pieces, covering the workshop floor. As he stood there looking at the mess he just made, Kurt Zimmerman was at a crossroads moment.


A real issue: video game developers are being accused of using AI – even when they aren't

The Guardian

In April, game developer Stamina Zero achieved what should have been a marketing slam-dunk: the launch trailer for the studio's game Little Droid was published on PlayStation's official YouTube channel. The response was a surprise for the developer. The game looks interesting, people wrote in the comments, but was "ruined" by AI art. But the game's cover art, used as the thumbnail for the YouTube video, was in fact made by a real person, according to developer Lana Ro. "We know the artist, we've seen her work, so such a negative reaction was unexpected for us, and at first we didn't know how to respond or how to feel," Ro said. It's not wrong for people to be worried about AI use in video games – in fact, it's good to be sceptical, and ensure that the media you support aligns with your values. Common arguments against generative AI relate to environmental impact, art theft and just general quality, and video game developers are grappling with how generative AI will impact their jobs.


What Matters in Learning from Large-Scale Datasets for Robot Manipulation

Saxena, Vaibhav, Bronars, Matthew, Arachchige, Nadun Ranawaka, Wang, Kuancheng, Shin, Woo Chul, Nasiriany, Soroush, Mandlekar, Ajay, Xu, Danfei

arXiv.org Artificial Intelligence

Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a systematic understanding of what data should be collected to improve the utility of a robotics dataset and facilitate downstream policy learning. In this work, we conduct a large-scale dataset composition study to answer this question. We develop a data generation framework to procedurally emulate common sources of diversity in existing datasets (such as sensor placements and object types and arrangements), and use it to generate large-scale robot datasets with controlled compositions, enabling a suite of dataset composition studies that would be prohibitively expensive in the real world. We focus on two practical settings: (1) what types of diversity should be emphasized when future researchers collect large-scale datasets for robotics, and (2) how should current practitioners retrieve relevant demonstrations from existing datasets to maximize downstream policy performance on tasks of interest. Our study yields several critical insights -- for example, we find that camera poses and spatial arrangements are crucial dimensions for both diversity in collection and alignment in retrieval. In real-world robot learning settings, we find that not only do our insights from simulation carry over, but our retrieval strategies on existing datasets such as DROID allow us to consistently outperform existing training strategies by up to 70%. More results at https://robo-mimiclabs.github.io/


DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, Alexander, Pertsch, Karl, Nair, Suraj, Balakrishna, Ashwin, Dasari, Sudeep, Karamcheti, Siddharth, Nasiriany, Soroush, Srirama, Mohan Kumar, Chen, Lawrence Yunliang, Ellis, Kirsty, Fagan, Peter David, Hejna, Joey, Itkina, Masha, Lepert, Marion, Ma, Yecheng Jason, Miller, Patrick Tree, Wu, Jimmy, Belkhale, Suneel, Dass, Shivin, Ha, Huy, Jain, Arhan, Lee, Abraham, Lee, Youngwoon, Memmel, Marius, Park, Sungjae, Radosavovic, Ilija, Wang, Kaiyuan, Zhan, Albert, Black, Kevin, Chi, Cheng, Hatch, Kyle Beltran, Lin, Shan, Lu, Jingpei, Mercat, Jean, Rehman, Abdul, Sanketi, Pannag R, Sharma, Archit, Simpson, Cody, Vuong, Quan, Walke, Homer Rich, Wulfe, Blake, Xiao, Ted, Yang, Jonathan Heewon, Yavary, Arefeh, Zhao, Tony Z., Agia, Christopher, Baijal, Rohan, Castro, Mateo Guaman, Chen, Daphne, Chen, Qiuyu, Chung, Trinity, Drake, Jaimyn, Foster, Ethan Paul, Gao, Jensen, Herrera, David Antonio, Heo, Minho, Hsu, Kyle, Hu, Jiaheng, Jackson, Donovon, Le, Charlotte, Li, Yunshuang, Lin, Kevin, Lin, Roy, Ma, Zehan, Maddukuri, Abhiram, Mirchandani, Suvir, Morton, Daniel, Nguyen, Tony, O'Neill, Abigail, Scalise, Rosario, Seale, Derick, Son, Victor, Tian, Stephen, Tran, Emi, Wang, Andrew E., Wu, Yilin, Xie, Annie, Yang, Jingyun, Yin, Patrick, Zhang, Yunchu, Bastani, Osbert, Berseth, Glen, Bohg, Jeannette, Goldberg, Ken, Gupta, Abhinav, Gupta, Abhishek, Jayaraman, Dinesh, Lim, Joseph J, Malik, Jitendra, Martín-Martín, Roberto, Ramamoorthy, Subramanian, Sadigh, Dorsa, Song, Shuran, Wu, Jiajun, Yip, Michael C., Zhu, Yuke, Kollar, Thomas, Levine, Sergey, Finn, Chelsea

arXiv.org Artificial Intelligence

The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.


Improving Model's Focus Improves Performance of Deep Learning-Based Synthetic Face Detectors

Piland, Jacob, Czajka, Adam, Sweet, Christopher

arXiv.org Artificial Intelligence

Deep learning-based models generalize better to unknown data samples after being guided "where to look" by incorporating human perception into training strategies. We made an observation that the entropy of the model's salience trained in that way is lower when compared to salience entropy computed for models training without human perceptual intelligence. Thus the question: does further increase of model's focus, by lowering the entropy of model's class activation map, help in further increasing the performance? In this paper we propose and evaluate several entropy-based new loss function components controlling the model's focus, covering the full range of the level of such control, from none to its "aggressive" minimization. We show, using a problem of synthetic face detection, that improving the model's focus, through lowering entropy, leads to models that perform better in an open-set scenario, in which the test samples are synthesized by unknown generative models. We also show that optimal performance is obtained when the model's loss function blends three aspects: regular classification, low-entropy of the model's focus, and human-guided saliency.


DROID: Driver-centric Risk Object Identification

Li, Chengxi, Chan, Stanley H., Chen, Yi-Ting

arXiv.org Artificial Intelligence

Identification of high-risk driving situations is generally approached through collision risk estimation or accident pattern recognition. In this work, we approach the problem from the perspective of subjective risk. We operationalize subjective risk assessment by predicting driver behavior changes and identifying the cause of changes. To this end, we introduce a new task called driver-centric risk object identification (DROID), which uses egocentric video to identify object(s) influencing a driver's behavior, given only the driver's response as the supervision signal. We formulate the task as a cause-effect problem and present a novel two-stage DROID framework, taking inspiration from models of situation awareness and causal inference. A subset of data constructed from the Honda Research Institute Driving Dataset (HDD) is used to evaluate DROID. We demonstrate state-of-the-art DROID performance, even compared with strong baseline models using this dataset. Additionally, we conduct extensive ablative studies to justify our design choices. Moreover, we demonstrate the applicability of DROID for risk assessment.