GAgent: An Adaptive Rigid-Soft Gripping Agent with Vision Language Models for Complex Lighting Environments

Li, Zhuowei, Zhang, Miao, Lin, Xiaotian, Yin, Meng, Lu, Shuai, Wang, Xueqian

arXiv.org Artificial Intelligence 

In recent years, the gripping use of unmanned aerial vehicles (UAVs) has emerged as a new trending research direction [1, 2]. However, the grabbing scenes in the open world are very complex, which leads to the development of robotic grasping systems with advanced cognitive and adaptable grasping capabilities. To achieve high-level cognitive abilities, reinforcement learning embodiment is studied[3, 4]. In [3], Scalable Deep Reinforcement Learning is used to handle large amounts of off-policy image data for complex tasks like grasping. However, RL-based embodiment has posed challenges in terms of generalization capability, sample-effectiveness capability, and profound reasoning capability, especially in dynamic and uncertain real environments. Recently, Large multimodal models (LMMs), such as MiniGPT-4 [5] and LLaVA [6], have exhibited impressive performance in the domains of natural instruction-following and visual cognition. Therefore, LMMs are integrated with the physical world in the embodied agent. Apart from RL algorithms for specific tasks, LMMs-based agents have generalization capabilities [7, 8] though fine-tune methods, such as human demonstrations [9], vision-language cross-modal connector[10], ever-growing skill library [11] and so on. On-policy (RL) algorithms face challenges in terms of sample efficiency.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found