Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement

Katsumata, Kei, Kambara, Motonari, Yashima, Daichi, Korekata, Ryosuke, Sugiura, Komei

arXiv.org Artificial Intelligence 

Abstract-- We consider the problem of generating free-form mobile manipulation instructions based on a target object image and receptacle image. Conventional image captioning models are not able to generate appropriate instructions because their architectures are typically optimized for single-image. In this study, we propose a model that handles both the target object and receptacle to generate free-form instruction sentences for mobile manipulation tasks. Moreover, we introduce a novel training method that effectively incorporates the scores from both learning-based and n-gram based automatic evaluation metrics as rewards. This method enables the model to learn the co-occurrence relationships between words and appropriate paraphrases. Therefore, models are required to appropriately handle both images. Hence, these methods are inappropriate essential in a variety of contexts such as elderly care facilities for generating mobile manipulation instructions based on and daily support for disabilities. In particular, the integration multiple images. of service robots in elderly care facilities significantly We propose a model that generates mobile manipulation reduces the burden on caregivers and addresses the growing instructions using a target object image and a receptacle demand driven by the rise in the elderly population.