Kroemer, Oliver
GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
Saxena, Saumya, Buchanan, Blake, Paxton, Chris, Chen, Bingqing, Vaskevicius, Narunas, Palmieri, Luigi, Francis, Jonathan, Kroemer, Oliver
For example, to answer explore and develop a semantic understanding of an unseen the question "How many chairs are there at the dining environment in order to answer a situated question table?", the agent might rely on commonsense knowledge with confidence. This remains a challenging problem in to understand that dining tables are often associated with robotics, due to the difficulties in obtaining useful semantic dining rooms and dining rooms are usually near the kitchen representations, updating these representations online, and towards the back of a home. A reasonable navigation strategy leveraging prior world knowledge for efficient exploration would involve navigating to the back of the house to and planning. Aiming to address these limitations, we propose locate a kitchen. To ground this search in the current environment, GraphEQA, a novel approach that utilizes real-time however, requires the agent to continually maintain 3D metric-semantic scene graphs (3DSGs) and task relevant an understanding of where it is, memory of where it images as multi-modal memory for grounding Vision-has been, and what further exploratory actions will lead it Language Models (VLMs) to perform EQA tasks in unseen to relevant regions. Finally, the agent needs to observe the environments. We employ a hierarchical planning approach target object(s) and perform visual grounding, in order to that exploits the hierarchical nature of 3DSGs for structured reason about the number of chairs around the dining table, planning and semantic-guided exploration. Through experiments and confidently answer the question correctly.
SonicBoom: Contact Localization Using Array of Microphones
Lee, Moonyoung, Yoo, Uksang, Oh, Jean, Ichnowski, Jeffrey, Kantor, George, Kroemer, Oliver
In cluttered environments where visual sensors encounter heavy occlusion, such as in agricultural settings, tactile signals can provide crucial spatial information for the robot to locate rigid objects and maneuver around them. We introduce SonicBoom, a holistic hardware and learning pipeline that enables contact localization through an array of contact microphones. While conventional sound source localization methods effectively triangulate sources in air, localization through solid media with irregular geometry and structure presents challenges that are difficult to model analytically. We address this challenge through a feature engineering and learning based approach, autonomously collecting 18,000 robot interaction sound pairs to learn a mapping between acoustic signals and collision locations on the robot end effector link. By leveraging relative features between microphones, SonicBoom achieves localization errors of 0.42cm for in distribution interactions and maintains robust performance of 2.22cm error even with novel objects and contact conditions. We demonstrate the system's practical utility through haptic mapping of occluded branches in mock canopy settings, showing that acoustic based sensing can enable reliable robot navigation in visually challenging environments.
Autonomous Sensor Exchange and Calibration for Cornstalk Nitrate Monitoring Robot
Lee, Janice Seungyeon, Detlefsen, Thomas, Lawande, Shara, Ghatge, Saudamini, Shanthi, Shrudhi Ramesh, Mukkamala, Sruthi, Kantor, George, Kroemer, Oliver
Interactive sensors are an important component of robotic systems but often require manual replacement due to wear and tear. Automating this process can enhance system autonomy and facilitate long-term deployment. We developed an autonomous sensor exchange and calibration system for an agriculture crop monitoring robot that inserts a nitrate sensor into cornstalks. A novel gripper and replacement mechanism, featuring a reliable funneling design, were developed to enable efficient and reliable sensor exchanges. To maintain consistent nitrate sensor measurement, an on-board sensor calibration station was integrated to provide in-field sensor cleaning and calibration. The system was deployed at the Ames Curtis Farm in June 2024, where it successfully inserted nitrate sensors with high accuracy into 30 cornstalks with a 77$\%$ success rate.
RecoveryChaining: Learning Local Recovery Policies for Robust Manipulation
Vats, Shivam, Jha, Devesh K., Likhachev, Maxim, Kroemer, Oliver, Romeres, Diego
Model-based planners and controllers are commonly used to solve complex manipulation problems as they can efficiently optimize diverse objectives and generalize to long horizon tasks. However, they are limited by the fidelity of their model which oftentimes leads to failures during deployment. To enable a robot to recover from such failures, we propose to use hierarchical reinforcement learning to learn a separate recovery policy. The recovery policy is triggered when a failure is detected based on sensory observations and seeks to take the robot to a state from which it can complete the task using the nominal model-based controllers. Our approach, called RecoveryChaining, uses a hybrid action space, where the model-based controllers are provided as additional \emph{nominal} options which allows the recovery policy to decide how to recover, when to switch to a nominal controller and which controller to switch to even with \emph{sparse rewards}. We evaluate our approach in three multi-step manipulation tasks with sparse rewards, where it learns significantly more robust recovery policies than those learned by baselines. Finally, we successfully transfer recovery policies learned in simulation to a physical robot to demonstrate the feasibility of sim-to-real transfer with our method.
Tilde: Teleoperation for Dexterous In-Hand Manipulation Learning with a DeltaHand
Si, Zilin, Zhang, Kevin Lee, Temel, Zeynep, Kroemer, Oliver
Dexterous robotic manipulation remains a challenging domain due to its strict demands for precision and robustness on both hardware and software. While dexterous robotic hands have demonstrated remarkable capabilities in complex tasks, efficiently learning adaptive control policies for hands still presents a significant hurdle given the high dimensionalities of hands and tasks. To bridge this gap, we propose Tilde, an imitation learning-based in-hand manipulation system on a dexterous DeltaHand. It leverages 1) a low-cost, configurable, simple-to-control, soft dexterous robotic hand, DeltaHand, 2) a user-friendly, precise, real-time teleoperation interface, TeleHand, and 3) an efficient and generalizable imitation learning approach with diffusion policies. Our proposed TeleHand has a kinematic twin design to the DeltaHand that enables precise one-to-one joint control of the DeltaHand during teleoperation. This facilitates efficient high-quality data collection of human demonstrations in the real world. To evaluate the effectiveness of our system, we demonstrate the fully autonomous closed-loop deployment of diffusion policies learned from demonstrations across seven dexterous manipulation tasks with an average 90% success rate.
Leveraging Simulation-Based Model Preconditions for Fast Action Parameter Optimization with Multiple Models
Seker, M. Yunus, Kroemer, Oliver
Optimizing robotic action parameters is a significant challenge for manipulation tasks that demand high levels of precision and generalization. Using a model-based approach, the robot must quickly reason about the outcomes of different actions using a predictive model to find a set of parameters that will have the desired effect. The model may need to capture the behaviors of rigid and deformable objects, as well as objects of various shapes and sizes. Predictive models often need to trade-off speed for prediction accuracy and generalization. This paper proposes a framework that leverages the strengths of multiple predictive models, including analytical, learned, and simulation-based models, to enhance the efficiency and accuracy of action parameter optimization. Our approach uses Model Deviation Estimators (MDEs) to determine the most suitable predictive model for any given state-action parameters, allowing the robot to select models to make fast and precise predictions. We extend the MDE framework by not only learning sim-to-real MDEs, but also sim-to-sim MDEs. Our experiments show that these sim-to-sim MDEs provide significantly faster parameter optimization as well as a basis for efficiently learning sim-to-real MDEs through finetuning. The ease of collecting sim-to-sim training data also allows the robot to learn MDEs based directly on visual inputs and local material properties.
Hefty: A Modular Reconfigurable Robot for Advancing Robot Manipulation in Agriculture
Guri, Dominic, Lee, Moonyoung, Kroemer, Oliver, Kantor, George
This paper presents a modular, reconfigurable robot platform for robot manipulation in agriculture. While robot manipulation promises great advancements in automating challenging, complex tasks that are currently best left to humans, it is also an expensive capital investment for researchers and users because it demands significantly varying robot configurations depending on the task. Modular robots provide a way to obtain multiple configurations and reduce costs by enabling incremental acquisition of only the necessary modules. The robot we present, Hefty, is designed to be modular and reconfigurable. It is designed for both researchers and end-users as a means to improve technology transfer from research to real-world application. This paper provides a detailed design and integration process, outlining the critical design decisions that enable modularity in the mobility of the robot as well as its sensor payload, power systems, computing, and fixture mounting. We demonstrate the utility of the robot by presenting five configurations used in multiple real-world agricultural robotics applications.
MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Saxena, Saumya, Sharma, Mohit, Kroemer, Oliver
Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
Task-Oriented Active Learning of Model Preconditions for Inaccurate Dynamics Models
LaGrassa, Alex, Lee, Moonyoung, Kroemer, Oliver
When planning with an inaccurate dynamics model, a practical strategy is to restrict planning to regions of state-action space where the model is accurate: also known as a model precondition. Empirical real-world trajectory data is valuable for defining data-driven model preconditions regardless of the model form (analytical, simulator, learned, etc...). However, real-world data is often expensive and dangerous to collect. In order to achieve data efficiency, this paper presents an algorithm for actively selecting trajectories to learn a model precondition for an inaccurate pre-specified dynamics model. Our proposed techniques address challenges arising from the sequential nature of trajectories, and potential benefit of prioritizing task-relevant data. The experimental analysis shows how algorithmic properties affect performance in three planning scenarios: icy gridworld, simulated plant watering, and real-world plant watering. Results demonstrate an improvement of approximately 80% after only four real-world trajectories when using our proposed techniques.
DELTAHANDS: A Synergistic Dexterous Hand Framework Based on Delta Robots
Si, Zilin, Zhang, Kevin, Kroemer, Oliver, Temel, F. Zeynep
Dexterous robotic manipulation in unstructured environments can aid in everyday tasks such as cleaning and caretaking. Anthropomorphic robotic hands are highly dexterous and theoretically well-suited for working in human domains, but their complex designs and dynamics often make them difficult to control. By contrast, parallel-jaw grippers are easy to control and are used extensively in industrial applications, but they lack the dexterity for various kinds of grasps and in-hand manipulations. In this work, we present DELTAHANDS, a synergistic dexterous hand framework with Delta robots. The DELTAHANDS are soft, easy to reconfigure, simple to manufacture with low-cost off-the-shelf materials, and possess high degrees of freedom that can be easily controlled. DELTAHANDS' dexterity can be adjusted for different applications by leveraging actuation synergies, which can further reduce the control complexity, overall cost, and energy consumption. We characterize the Delta robots' kinematics accuracy, force profiles, and workspace range to assist with hand design. Finally, we evaluate the versatility of DELTAHANDS by grasping a diverse set of objects and by using teleoperation to complete three dexterous manipulation tasks: cloth folding, cap opening, and cable arrangement. We open-source our hand framework at https://sites.google.com/view/deltahands/.