Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization
Kawaharazuka, Kento, Obinata, Yoshiki, Kanazawa, Naoaki, Okada, Kei, Inaba, Masayuki
–arXiv.org Artificial Intelligence
For example, the robot must recognize whether a door is open, a light is on, water is running, a fire is burning, and so on. In order to change the robot's behavior based on the recognition results, state recognition is usually performed with discrete values of about two or three options. Until now, appropriate individual methods have been used for each state to be recognized, such as direct processing of images or point clouds by human programming [3, 4], creating a dataset with annotations and training neural networks [5], or detecting the state by installing new sensors [6, 7]. However, these methods require us to manually program the process for each state recognition, to train neural networks one by one, and to increase the number of sensors installed. In addition, this will increase the number of programs and trained models needed for each state recognition, which will cause problems in management of source code and computer resource. To cope with these problems, a single program or model should be able to recognize multiple states. In this study, we propose a method to easily recognize various environmental states in a unified manner and through the spoken language (Figure 1). In order to perform state recognition through the spoken language, we use pre-trained large-scale vision-language models (VLMs) [8-12]. Currently, VLMs are being used for map generation [13, 14], scene understanding [15-17], and feature extraction for behav-Corresponding author.
arXiv.org Artificial Intelligence
Sep-26-2024
- Country:
- Asia > Japan > Honshū
- Chūbu > Ishikawa Prefecture
- Kanazawa (0.05)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.14)
- Chūbu > Ishikawa Prefecture
- Asia > Japan > Honshū
- Genre:
- Research Report > New Finding (0.50)
- Industry:
- Transportation > Air (0.42)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.69)
- Natural Language (1.00)
- Robots (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence