Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Ji, Bokai, Gu, Jie, Ma, Xiaokang, Tang, Chu, Chen, Jingmin, Li, Guangxia

Aug-26-2025–arXiv.org Artificial Intelligence

Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Aug-26-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)
- Workflow (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Natural Language > Large Language Model (0.97)
  - Machine Learning > Neural Networks
    - Deep Learning (0.96)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found