AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models

Mishra, Sarthak, Yadav, Rishabh Dev, Das, Avirup, Gupta, Saksham, Pan, Wei, Roy, Spandan

Nov-4-2025–arXiv.org Artificial Intelligence

This reasoning-action loop continues until task completion, enabling the VLM to focus on semantic reasoning while delegating precise execution to robust controllers. The framework is evaluated in simulation and real-world experiments using a pretrained VLM, and comprehensive comparison and ablation studies are carried out to verify its performance. CLIPSeg [12] is used for prompt-based segmentation, maintaining a unified prompting pipeline from perception to reasoning. A. Additional Related W orks Aerial manipulation has progressed from vision-guided approaches relying on onboard cameras and artificial visual cues [13], to fully markerless grasping systems using onboard perception [14], and more recently end-effector-centric frameworks for versatile manipulation [15], yet all remain focused on execution rather than language-level reasoning. In parallel, VLAs [2]-[5] combine LLMbased planning [16], [17] with perceptual grounding from models such as CLIP [18], CLIPort [19], and LLaV A [20], but their end-to-end policies are data-intensive and prone to unsafe behaviors from ambiguous outputs, or adversarial prompts, motivating hybrid approaches where reasoning is decoupled from execution via modular skill primitives [21], [22]. For multirotors specifically, foundation model research has focused on mission planning [23], spatial reasoning [24], and direct control [25] which advances locomotion but does not extend to aerial manipulation, and it requires exploration coupled with grasping and placement [26]. In summary, control-focused aerial manipulation, reasoning-focused VLAs, and navigation-focused UA V -VLN each address parts of the problem, but none unify perception, reasoning, and execution for aerial manipulation. Together, these limitations motivate AERMANI-VLM, which unifies open-vocabulary perception, structured reasoning, and safe skill execution for aerial manipulation.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Nov-4-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Industry:
- Government > Military (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.69)
  - Machine Learning > Learning Graphical Models
    - Undirected Networks > Markov Models (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found