Domae, Yukiyasu
Learning Bimanual Manipulation via Action Chunking and Inter-Arm Coordination with Transformers
Motoda, Tomohiro, Hanai, Ryo, Nakajo, Ryoichi, Murooka, Masaki, Erich, Floris, Domae, Yukiyasu
Robots that can operate autonomously in a human living environment are necessary to have the ability to handle various tasks flexibly. One crucial element is coordinated bimanual movements that enable functions that are difficult to perform with one hand alone. In recent years, learning-based models that focus on the possibilities of bimanual movements have been proposed. However, the high degree of freedom of the robot makes it challenging to reason about control, and the left and right robot arms need to adjust their actions depending on the situation, making it difficult to realize more dexterous tasks. To address the issue, we focus on coordination and efficiency between both arms, particularly for synchronized actions. Therefore, we propose a novel imitation learning architecture that predicts cooperative actions. We differentiate the architecture for both arms and add an intermediate encoder layer, Inter-Arm Coordinated transformer Encoder (IACE), that facilitates synchronization and temporal alignment to ensure smooth and coordinated actions. To verify the effectiveness of our architectures, we perform distinctive bimanual tasks. The experimental results showed that our model demonstrated a high success rate for comparison and suggested a suitable architecture for the policy learning of bimanual manipulation.
SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design
Motoda, Tomohiro, Kitamura, Takahide, Hanai, Ryo, Domae, Yukiyasu
The development of large language models and vision-language models (VLMs) has resulted in the increasing use of robotic systems in various fields. However, the effective integration of these models into real-world robotic tasks is a key challenge. We developed a versatile robotic system called SuctionPrompt that utilizes prompting techniques of VLMs combined with 3D detections to perform product-picking tasks in diverse and dynamic environments. Our method highlights the importance of integrating 3D spatial information with adaptive action planning to enable robots to approach and manipulate objects in novel environments. In the validation experiments, the system accurately selected suction points 75.4%, and achieved a 65.0% success rate in picking common items. This study highlights the effectiveness of VLMs in robotic manipulation tasks, even with simple 3D processing.
Visual Imitation Learning of Non-Prehensile Manipulation Tasks with Dynamics-Supervised Models
Mustafa, Abdullah, Hanai, Ryo, Ramirez, Ixchel, Erich, Floris, Nakajo, Ryoichi, Domae, Yukiyasu, Ogata, Tetsuya
Unlike quasi-static robotic manipulation tasks like pick-and-place, dynamic tasks such as non-prehensile manipulation pose greater challenges, especially for vision-based control. Successful control requires the extraction of features relevant to the target task. In visual imitation learning settings, these features can be learnt by backpropagating the policy loss through the vision backbone. Yet, this approach tends to learn task-specific features with limited generalizability. Alternatively, learning world models can realize more generalizable vision backbones. Utilizing the learnt features, task-specific policies are subsequently trained. Commonly, these models are trained solely to predict the next RGB state from the current state and action taken. But only-RGB prediction might not fully-capture the task-relevant dynamics. In this work, we hypothesize that direct supervision of target dynamic states (Dynamics Mapping) can learn better dynamics-informed world models. Beside the next RGB reconstruction, the world model is also trained to directly predict position, velocity, and acceleration of environment rigid bodies. To verify our hypothesis, we designed a non-prehensile 2D environment tailored to two tasks: "Balance-Reaching" and "Bin-Dropping". When trained on the first task, dynamics mapping enhanced the task performance under different training configurations (Decoupled, Joint, End-to-End) and policy architectures (Feedforward, Recurrent). Notably, its most significant impact was for world model pretraining boosting the success rate from 21% to 85%. Although frozen dynamics-informed world models could generalize well to a task with in-domain dynamics, but poorly to a one with out-of-domain dynamics.
Motion Priority Optimization Framework towards Automated and Teleoperated Robot Cooperation in Industrial Recovery Scenarios
Itadera, Shunki, Domae, Yukiyasu
In this study, we introduce an optimization framework aimed at enhancing the efficiency of motion priority design in scenarios involving automated and teleoperated robots within an industrial recovery context. The escalating utilization of industrial robots at manufacturing sites has been instrumental in mitigating human workload. Nevertheless, the challenge persists in achieving effective human-robot collaboration/cooperation where human workers and robots share a workspace for collaborative tasks. In the event of an industrial robot encountering a failure, it necessitates the suspension of the corresponding factory cell for safe recovery. Given the limited capacity of pre-programmed robots to rectify such failures, human intervention becomes imperative, requiring entry into the robot workspace to address the dropped object while the robot system is halted. This non-continuous manufacturing process results in productivity loss. Robotic teleoperation has emerged as a promising technology enabling human workers to undertake high-risk tasks remotely and safely. Our study advocates for the incorporation of robotic teleoperation in the recovery process during manufacturing failure scenarios, which is referred to as "Cooperative Tele-Recovery". Our proposed approach involves the formulation of priority rules designed to facilitate collision avoidance between manufacturing and recovery robots. This, in turn, ensures a continuous manufacturing process with minimal production loss within a configurable risk limitation. We present a comprehensive motion priority optimization framework, encompassing an HRC simulator-based priority optimization and a cooperative multi-robot controller, to identify optimal parameters for the priority function. The framework dynamically adjusts the allocation of motion priorities for manufacturing and recovery robots while adhering to predefined risk limitations.
NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields
Erich, Floris, Chiba, Naoya, Yoshiyasu, Yusuke, Ando, Noriaki, Hanai, Ryo, Domae, Yukiyasu
Models trained using weakly supervised learning might outperform stateof-the-art models when the SOTA models are not trained on task specific data, but their performance is lower than Specialized labeling tools are essential for labeling vision SOTA models evaluated on evaluation data more similar to datasets, and both academic researchers and commercial their training data. Thus there is a need for tools that can entities have released such tools. Most existing labeling tools support large datasets creation in a time efficient manner (such as Segment Anything Labeling Tool [6] and Roboflow and low cost manner. We hope to contribute to solving [7]) use single images and therefore require significant this problem by introducing a labeling tool for computer human effort for annotating long sequences, use sequential vision datasets that uses the power of Neural Radiance data but have no geometric understanding so they cannot be Fields (NeRF) [5] for photorealistic rendering and geometric used for annotating 6DOF poses [8], or require depth data understanding. Because 3D Vision can take advantage of 3D to obtain geometric information [9, 10, 11, 12]. Our toolkit, consistency, labeled information about a single scene can be NeuralLabeling, operates on sequences of images and can applied to images from multiple viewpoints. This property thus be used to more rapidly label large datasets, and by works particularly well with photorealistic renderings such using depth reconstruction using NeRF [5] it does not rely as NeRF, where richly annotated data with many views is on input depth data.
Learning to Dexterously Pick or Separate Tangled-Prone Objects for Industrial Bin Picking
Zhang, Xinyi, Domae, Yukiyasu, Wan, Weiwei, Harada, Kensuke
Abstract-- Industrial bin picking for tangled-prone objects requires the robot to either pick up untangled objects or perform separation manipulation when the bin contains no isolated objects. The robot must be able to flexibly perform appropriate actions based on the current observation. It is challenging due to high occlusion in the clutter, elusive entanglement phenomena, and the need for skilled manipulation planning. In this paper, we propose an autonomous, effective and general approach for picking up tangled-prone objects for industrial bin picking. First, we learn PickNet - a network that maps the visual observation to pixel-wise possibilities of picking isolated objects or separating tangled objects and infers the corresponding grasp. Then, we propose two effective separation strategies: Dropping the entangled objects into a buffer bin to reduce the degree of entanglement; Pulling to separate the entangled objects in the buffer bin planned by PullNet - a network that predicts position and direction for pulling from visual input. Other studies estimates the pose Bin picking is a valuable task in manufacturing to automate of object and evaluate the entanglement level for each object the assembly process. It deploys robots to pick [12], [13]. Such a paradigm relies on the full knowledge of necessary objects from disorganized bins, rather than relying the objects and may suffer from cumulative perception errors on human workers to arrange the objects or using a large due to heavy occlusion or self-occlusion of an individual number of part feeders. Existing studies have tackled some complex-shaped object. Other studies utilize force and torque challenges in bin picking such as planning grasps under rich sensors to classify if the robot grasps multiple objects [14].
A Closed-Loop Bin Picking System for Entangled Wire Harnesses using Bimanual and Dynamic Manipulation
Zhang, Xinyi, Domae, Yukiyasu, Wan, Weiwei, Harada, Kensuke
This paper addresses the challenge of industrial bin picking using entangled wire harnesses. Wire harnesses are essential in manufacturing but poses challenges in automation due to their complex geometries and propensity for entanglement. Our previous work tackled this issue by proposing a quasi-static pulling motion to separate the entangled wire harnesses. However, it still lacks sufficiency and generalization to various shapes and structures. In this paper, we deploy a dual-arm robot that can grasp, extract and disentangle wire harnesses from dense clutter using dynamic manipulation. The robot can swing to dynamically discard the entangled objects and regrasp to adjust the undesirable grasp pose. To improve the robustness and accuracy of the system, we leverage a closed-loop framework that uses haptic feedback to detect entanglement in real-time and flexibly adjust system parameters. Our bin picking system achieves an overall success rate of 91.2% in the real-world experiments using two different types of long wire harnesses. It demonstrates the effectiveness of our system in handling various wire harnesses for industrial bin picking.
Force Map: Learning to Predict Contact Force Distribution from Vision
Hanai, Ryo, Domae, Yukiyasu, Ramirez-Alpizar, Ixchel G., Leme, Bruno, Ogata, Tetsuya
When humans see a scene, they can roughly imagine the forces applied to objects based on their experience and use them to handle the objects properly. This paper considers transferring this "force-visualization" ability to robots. We hypothesize that a rough force distribution (named "force map") can be utilized for object manipulation strategies even if accurate force estimation is impossible. Based on this hypothesis, we propose a training method to predict the force map from vision. To investigate this hypothesis, we generated scenes where objects were stacked in bulk through simulation and trained a model to predict the contact force from a single image. We further applied domain randomization to make the trained model function on real images. The experimental results showed that the model trained using only synthetic images could predict approximate patterns representing the contact areas of the objects even for real images. Then, we designed a simple algorithm to plan a lifting direction using the predicted force distribution. We confirmed that using the predicted force distribution contributes to finding natural lifting directions for typical real-world scenes. Furthermore, the evaluation through simulations showed that the disturbance caused to surrounding objects was reduced by 26 % (translation displacement) and by 39 % (angular displacement) for scenes where objects were overlapping.
Learning Efficient Policies for Picking Entangled Wire Harnesses: An Approach to Industrial Bin Picking
Zhang, Xinyi, Domae, Yukiyasu, Wan, Weiwei, Harada, Kensuke
Wire harnesses are essential connecting components in manufacturing industry but are challenging to be automated in industrial tasks such as bin picking. They are long, flexible and tend to get entangled when randomly placed in a bin. This makes it difficult for the robot to grasp a single one in dense clutter. Besides, training or collecting data in simulation is challenging due to the difficulties in modeling the combination of deformable and rigid components for wire harnesses. In this work, instead of directly lifting wire harnesses, we propose to grasp and extract the target following a circle-like trajectory until it is untangled. We learn a policy from real-world data that can infer grasps and separation actions from visual observation. Our policy enables the robot to efficiently pick and separate entangled wire harnesses by maximizing success rates and reducing execution time. To evaluate our policy, we present a set of real-world experiments on picking wire harnesses. Our policy achieves an overall 84.6% success rate compared with 49.2% in baseline. We also evaluate the effectiveness of our policy under different clutter scenarios using unseen types of wire harnesses. Results suggest that our approach is feasible for handling wire harnesses in industrial bin picking.