GAgent: An Adaptive Rigid-Soft Gripping Agent with Vision Language Models for Complex Lighting Environments
Li, Zhuowei, Zhang, Miao, Lin, Xiaotian, Yin, Meng, Lu, Shuai, Wang, Xueqian
–arXiv.org Artificial Intelligence
In recent years, the gripping use of unmanned aerial vehicles (UAVs) has emerged as a new trending research direction [1, 2]. However, the grabbing scenes in the open world are very complex, which leads to the development of robotic grasping systems with advanced cognitive and adaptable grasping capabilities. To achieve high-level cognitive abilities, reinforcement learning embodiment is studied[3, 4]. In [3], Scalable Deep Reinforcement Learning is used to handle large amounts of off-policy image data for complex tasks like grasping. However, RL-based embodiment has posed challenges in terms of generalization capability, sample-effectiveness capability, and profound reasoning capability, especially in dynamic and uncertain real environments. Recently, Large multimodal models (LMMs), such as MiniGPT-4 [5] and LLaVA [6], have exhibited impressive performance in the domains of natural instruction-following and visual cognition. Therefore, LMMs are integrated with the physical world in the embodied agent. Apart from RL algorithms for specific tasks, LMMs-based agents have generalization capabilities [7, 8] though fine-tune methods, such as human demonstrations [9], vision-language cross-modal connector[10], ever-growing skill library [11] and so on. On-policy (RL) algorithms face challenges in terms of sample efficiency.
arXiv.org Artificial Intelligence
Mar-16-2024
- Genre:
- Research Report (0.64)
- Industry:
- Health & Medicine
- Consumer Health (0.34)
- Therapeutic Area (0.55)
- Health & Medicine
- Technology: