scene context
- Europe > Germany > North Rhine-Westphalia (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Humans in Kitchens: A Dataset for Multi-Person Human Motion Forecasting with Scene Context
Forecasting human motion of multiple persons is very challenging. It requires to model the interactions between humans and the interactions with objects and the environment. For example, a person might want to make a coffee, but if the coffee machine is already occupied the person will haveto wait. These complex relations between scene geometry and persons ariseconstantly in our daily lives, and models that wish to accurately forecasthuman behavior will have to take them into consideration. To facilitate research in this direction, we propose Humans in Kitchens, alarge-scale multi-person human motion dataset with annotated 3D human poses, scene geometry and activities per person and frame.Our dataset consists of over 7.3h recorded data of up to 16 persons at the same time in four kitchen scenes, with more than 4M annotated human poses, represented by a parametric 3D body model. In addition, dynamic scene geometry and objects like chair or cupboard are annotated per frame. As first benchmarks, we propose two protocols for short-term and long-term human motion forecasting.
SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
In this paper, we propose a new framework for zero-shot object navigation.Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning.To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges.Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error.We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than \textbf{10\%} SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.Code of this project will be released in the final version.
LAP: Fast LAtent Diffusion Planner with Fine-Grained Feature Distillation for Autonomous Driving
Zhang, Jinhao, Xia, Wenlong, Zhou, Zhexuan, Gong, Youmin, Mei, Jie
Diffusion models have demonstrated strong capabilities for modeling humanlike driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a V AE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. We further introduce a fine-grained feature distillation mechanism to guide a better interaction and fusion between the high-level semantic planning space and the vectorized scene context. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speedup of at most 10 over previous SOT A approaches. A central challenge is handling the inherent uncertainty and behavioral multimodality of real-world traffic, where multiple distinct yet equally plausible maneuvers may be available (Y ang et al., 2023; Xiao et al., 2020). While early rule-based systems offered interpretability, their hand-crafted logic is brittle and fails to scale to the long-tail of open-world scenarios (Fan et al., 2018; Chen et al., 2024). Consequently, the field has shifted towards data-driven Imitation Learning (IL), which excels at capturing nuanced, human-like behaviors from large-scale datasets (Le Mero et al., 2022; Teng et al., 2022). However, the standard IL objective is notoriously susceptible to mode-averaging, where the model collapses multiple valid expert trajectories into a single, physically infeasible path, fundamentally failing to represent the multi-modal nature of human decision-making (Strohbeck et al., 2020). To overcome this limitation, Denoising Diffusion Probabilistic Models(DDPMs) have emerged as a powerful tool for modeling complex, multi-modal distributions (Liao et al., 2025; Ho et al., 2020). However, existing approaches models directly to raw trajectory waypoints are both computationally inefficient and conceptually flawed. This mirrors the core challenge of early image synthesis: operating in a high-dimensional pixel space expends vast model capacity on low-level details over high-level semantics (Rombach et al., 2022).
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction
Keskar, Maitrayee, Trivedi, Mohan, Greer, Ross
We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent's future 5-second trajectory in bird's-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Merced County > Merced (0.04)
- Transportation > Ground > Road (0.51)
- Information Technology (0.37)
- Automobiles & Trucks (0.37)
MOGRAS: Human Motion with Grasping in 3D Scenes
Bhosikar, Kunal, Katageri, Siddharth, Madhavaram, Vivek, Han, Kai, Sharma, Charu
Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.
- North America > United States (0.04)
- Asia > South Korea > Jeollanam-do > Muan (0.04)
- Asia > India > Telangana > Hyderabad (0.04)
- Asia > China > Hong Kong (0.04)
Humans in Kitchens: A Dataset for Multi-Person Human Motion Forecasting with Scene Context Julian Tanke
Forecasting human motion of multiple persons is very challenging. It requires to model the interactions between humans and the interactions with objects and the environment. For example, a person might want to make a coffee, but if the coffee machine is already occupied the person will have to wait.
- Europe > Germany > North Rhine-Westphalia (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
VIRTUE: Visual-Interactive Text-Image Universal Embedder
Wang, Wei-Yao, Tateishi, Kazuya, Wu, Qiyu, Takahashi, Shusuke, Mitsufuji, Yuki
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.
- North America > United States > Virginia (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Personalized Robotic Object Rearrangement from Scene Context
Ramachandruni, Kartik, Chernova, Sonia
Object rearrangement is a key task for household robots requiring personalization without explicit instructions, meaningful object placement in environments occupied with objects, and generalization to unseen objects and new environments. To facilitate research addressing these challenges, we introduce PARSEC, an object rearrangement benchmark for learning user organizational preferences from observed scene context to place objects in a partially arranged environment. PARSEC is built upon a novel dataset of 110K rearrangement examples crowdsourced from 72 users, featuring 93 object categories and 15 environments. To better align with real-world organizational habits, we propose ContextSortLM, an LLM-based personalized rearrangement model that handles flexible user preferences by explicitly accounting for objects with multiple valid placement locations when placing items in partially arranged environments. We evaluate ContextSortLM and existing personalized rearrangement approaches on the PARSEC benchmark and complement these findings with a crowdsourced evaluation of 108 online raters ranking model predictions based on alignment with user preferences. Our results indicate that personalized rearrangement models leveraging multiple scene context sources perform better than models relying on a single context source. Moreover, ContextSortLM outperforms other models in placing objects to replicate the target user's arrangement and ranks among the top two in all three environment categories, as rated by online evaluators. Importantly, our evaluation highlights challenges associated with modeling environment semantics across different environment categories and provides recommendations for future work.
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.59)
- (3 more...)