Koike, Hideki
Modality-Driven Design for Multi-Step Dexterous Manipulation: Insights from Neuroscience
Wake, Naoki, Kanehira, Atsushi, Saito, Daichi, Takamatsu, Jun, Sasabuchi, Kazuhiro, Koike, Hideki, Ikeuchi, Katsushi
Multi-step dexterous manipulation is a fundamental skill in household scenarios, yet remains an underexplored area in robotics. This paper proposes a modular approach, where each step of the manipulation process is addressed with dedicated policies based on effective modality input, rather than relying on a single end-to-end model. To demonstrate this, a dexterous robotic hand performs a manipulation task involving picking up and rotating a box. Guided by insights from neuroscience, the task is decomposed into three sub-skills, 1)reaching, 2)grasping and lifting, and 3)in-hand rotation, based on the dominant sensory modalities employed in the human brain. Each sub-skill is addressed using distinct methods from a practical perspective: a classical controller, a Vision-Language-Action model, and a reinforcement learning policy with force feedback, respectively. We tested the pipeline on a real robot to demonstrate the feasibility of our approach. The key contribution of this study lies in presenting a neuroscience-inspired, modality-driven methodology for multi-step dexterous manipulation.
APriCoT: Action Primitives based on Contact-state Transition for In-Hand Tool Manipulation
Saito, Daichi, Kanehira, Atsushi, Sasabuchi, Kazuhiro, Wake, Naoki, Takamatsu, Jun, Koike, Hideki, Ikeuchi, Katsushi
In-hand tool manipulation is an operation that not only manipulates a tool within the hand (i.e., in-hand manipulation) but also achieves a grasp suitable for a task after the manipulation. This study aims to achieve an in-hand tool manipulation skill through deep reinforcement learning. The difficulty of learning the skill arises because this manipulation requires (A) exploring long-term contact-state changes to achieve the desired grasp and (B) highly-varied motions depending on the contact-state transition. (A) leads to a sparsity of a reward on a successful grasp, and (B) requires an RL agent to explore widely within the state-action space to learn highly-varied actions, leading to sample inefficiency. To address these issues, this study proposes Action Primitives based on Contact-state Transition (APriCoT). APriCoT decomposes the manipulation into short-term action primitives by describing the operation as a contact-state transition based on three action representations (detach, crossover, attach). In each action primitive, fingers are required to perform short-term and similar actions. By training a policy for each primitive, we can mitigate the issues from (A) and (B). This study focuses on a fundamental operation as an example of in-hand tool manipulation: rotating an elongated object grasped with a precision grasp by half a turn to achieve the initial grasp. Experimental results demonstrated that ours succeeded in both the rotation and the achievement of the desired grasp, unlike existing studies. Additionally, it was found that the policy was robust to changes in object shape.
Constraint-aware Policy for Compliant Manipulation
Saito, Daichi, Sasabuchi, Kazuhiro, Wake, Naoki, Kanehira, Atsushi, Takamatsu, Jun, Koike, Hideki, Ikeuchi, Katsushi
Robot manipulation in a physically-constrained environment requires compliant manipulation. Compliant manipulation is a manipulation skill to adjust hand motion based on the force imposed by the environment. Recently, reinforcement learning (RL) has been applied to solve household operations involving compliant manipulation. However, previous RL methods have primarily focused on designing a policy for a specific operation that limits their applicability and requires separate training for every new operation. We propose a constraint-aware policy that is applicable to various unseen manipulations by grouping several manipulations together based on the type of physical constraint involved. The type of physical constraint determines the characteristic of the imposed force direction; thus, a generalized policy is trained in the environment and reward designed on the basis of this characteristic. This paper focuses on two types of physical constraints: prismatic and revolute joints. Experiments demonstrated that the same policy could successfully execute various compliant-manipulation operations, both in the simulation and reality. We believe this study is the first step toward realizing a generalized household-robot.
Text-driven object affordance for guiding grasp-type recognition in multimodal robot teaching
Wake, Naoki, Saito, Daichi, Sasabuchi, Kazuhiro, Koike, Hideki, Ikeuchi, Katsushi
This study investigates how text-driven object affordance, which provides prior knowledge about grasp types for each object, affects image-based grasp-type recognition in robot teaching. The researchers created labeled datasets of first-person hand images to examine the impact of object affordance on recognition performance. They evaluated scenarios with real and illusory objects, considering mixed reality teaching conditions where visual object information may be limited. The results demonstrate that object affordance improves image-based recognition by filtering out unlikely grasp types and emphasizing likely ones. The effectiveness of object affordance was more pronounced when there was a stronger bias towards specific grasp types for each object. These findings highlight the significance of object affordance in multimodal robot teaching, regardless of whether real objects are present in the images. Sample code is available on https://github.com/microsoft/arr-grasp-type-recognition.
Tracker: Model-based Reinforcement Learning for Tracking Control of Human Finger Attached with Thin McKibben Muscles
Saito, Daichi, Nagatomo, Eri, Pardomuan, Jefferson, Koike, Hideki
To adopt the soft hand exoskeleton to support activities of daily livings, it is necessary to control finger joints precisely with the exoskeleton. The problem of controlling joints to follow a given trajectory is called the tracking control problem. In this study, we focus on the tracking control problem of a human finger attached with thin McKibben muscles. To achieve precise control with thin McKibben muscles, there are two problems: one is the complex characteristics of the muscles, for example, non-linearity, hysteresis, uncertainties in the real world, and the other is the difficulty in accessing a precise model of the muscles and human fingers. To solve these problems, we adopted DreamerV2, which is a model-based reinforcement learning method, but the target trajectory cannot be generated by the learned model. Therefore, we propose Tracker, which is an extension of DreamerV2 for the tracking control problem. In the experiment, we showed that Tracker can achieve an approximately 81% smaller error than PID for the control of a two-link manipulator that imitates a part of human index finger from the metacarpal bone to the proximal bone. Tracker achieved the control of the third joint of the human index finger with a small error by being trained for approximately 60 minutes. In addition, it took approximately 15 minutes, which is less than the time required for the first training, to achieve almost the same accuracy by fine-tuning the policy pre-trained by the user's finger after taking off and attaching thin McKibben muscles again as the accuracy before taking off.
Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers
Sun, Yasheng, Zhou, Hang, Wang, Kaisiyuan, Wu, Qianyi, Hong, Zhibin, Liu, Jingtuo, Ding, Errui, Wang, Jingdong, Liu, Ziwei, Koike, Hideki
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.