UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes

Geng, Zichen, Hayder, Zeeshan, Liu, Wei, Mian, Ajmal

arXiv.org Artificial Intelligence 

Figure 1: Text-to-Motion sequences (left) and Text-to-HOI sequences (right) generated by our approach. Abstract --Human motion synthesis in complex scenes presents a fundamental challenge, extending beyond conventional T ext-to-Motion tasks by requiring the integration of diverse modalities such as static environments, movable objects, natural language prompts, and spatial waypoints. Existing language-conditioned motion models often struggle with scene-aware motion generation due to limitations in motion tokenization, which leads to information loss and fails to capture the continuous, context-dependent nature of 3D human movement. T o address these issues, we propose UniHM, a unified motion language model that leverages diffusion-based generation for synthesizing scene-aware human motion. UniHM is the first framework to support both T ext-to-Motion and T ext-to-Human-Object Interaction (HOI) in complex 3D scenes. Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization V AE (LFQ-V AE) that surpasses traditional VQ-V AEs in both reconstruction accuracy and generative performance; and (3) an enriched version of the Lingo dataset augmented with HumanML3D annotations, providing stronger supervision for scene-specific motion learning. Experimental results demonstrate that UniHM achieves comparative performance on the OMOMO benchmark for text-to-HOI synthesis and yields competitive results on HumanML3D for general text-conditioned motion generation. Human motion synthesis in complex scenes represents a challenging extension of the Text-to-Motion paradigm, with potential applications in virtual reality, robotics, and interactive environments where accurately synthesized human motion is critical to user experience. While language models have demonstrated considerable success in generating realistic human motion sequences based on text prompts, they struggle to achieve similar efficacy in scene-specific motion generation. Scene-based human motion synthesis requires not only an understanding of human motion but also an intricate integration of diverse modalities, such as static scene elements, moveable objects, text prompts, and motion waypoints. These modalities add layers of complexity that go beyond standard Text-to-Motion tasks, demanding a cohesive synthesis of environmental context and dynamic interaction.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found