AITopics | humanvla

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

Neural Information Processing SystemsMar-18-2026, 23:10:12 GMT

Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-student framework is utilized to develop HumanVLA. A state-based teacher policy is trained first using goal-conditioned reinforcement learning and adversarial motion prior. Then, it is distilled into a vision-language-action model via behavior cloning. We propose several key insights to facilitate the large-scale learning process. To support general object rearrangement by physical humanoid, we introduce a novel Human-in-the-Room dataset encompassing various rearrangement tasks. Through extensive experiments and analysis, we demonstrate the effectiveness of our approach.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.79)

Add feedback

215aeb07b5996c969c0123c3c6ee8f54-Paper-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 08:54:42 GMT

dataset, humanvla, rearrangement, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

Neural Information Processing SystemsOct-9-2025, 20:46:10 GMT

Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications.

dataset, humanvla, rearrangement, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Zhang, Haozhuo, Sun, Jingkai, Caprio, Michele, Tang, Jian, Zhang, Shanghang, Zhang, Qiang, Pan, Wei

arXiv.org Artificial IntelligenceAug-26-2025

We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations. HumanoidVerse is trained via a multi-stage curriculum using a dual-teacher distillation pipeline, enabling fluid transitions between sub-tasks without requiring environment resets. To support this, we construct a large-scale dataset comprising 350 multi-object tasks spanning four room layouts. Extensive experiments in the Isaac Gym simulator demonstrate that our method significantly outperforms prior state-of-the-art in both task success rate and spatial precision, and generalizes well to unseen environments and instructions. Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints. The video visualization results can be found on the project page: https://haozhuo-zhang.github.io/HumanoidVerse-project-page/.

artificial intelligence, instruction, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2508.16943

Genre: Research Report (0.50)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

Neural Information Processing SystemsMay-26-2025, 18:33:36 GMT

Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-student framework is utilized to develop HumanVLA. A state-based teacher policy is trained first using goal-conditioned reinforcement learning and adversarial motion prior.

artificial intelligence, machine learning, reinforcement learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

Xu, Xinyu, Zhang, Yizheng, Li, Yong-Lu, Han, Lei, Lu, Cewu

arXiv.org Artificial IntelligenceJun-28-2024

Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-student framework is utilized to develop HumanVLA. A state-based teacher policy is trained first using goal-conditioned reinforcement learning and adversarial motion prior. Then, it is distilled into a vision-language-action model via behavior cloning. We propose several key insights to facilitate the large-scale learning process. To support general object rearrangement by physical humanoid, we introduce a novel Human-in-the-Room dataset encompassing various rearrangement tasks. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed approach.

humanoid, humanvla, rearrangement, (17 more...)

arXiv.org Artificial Intelligence

2406.19972

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.82)

Industry: Media > Film (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Filters

Collaborating Authors

humanvla

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

215aeb07b5996c969c0123c3c6ee8f54-Paper-Conference.pdf

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid