ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Chang, Yun, Fermoselle, Leonor, Ta, Duy, Bucher, Bernadette, Carlone, Luca, Wang, Jiuguang

Apr-14-2025–arXiv.org Artificial Intelligence

While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks --a process called hierarchical task analysis -- is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis --to generate the task breakdown-- with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent sub-tasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Apr-14-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found