scene type
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Europe > Sweden > Östergötland County > Linköping (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Europe > Sweden (0.14)
- Europe > Netherlands (0.14)
- Appliances & Durable Goods (0.46)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.40)
- Energy > Oil & Gas > Downstream (0.40)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Steerable Scene Generation with Post Training and Inference-Time Search
Pfaff, Nicholas, Dai, Hongkai, Zakharov, Sergey, Iwase, Shun, Tedrake, Russ
Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments
Guo, Zhengsheng, Zheng, Linwei, Chen, Xinyang, Bai, Xuefeng, Chen, Kehai, Zhang, Min
While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.05)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (6 more...)
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects
Wang, Zhaowei, Zhang, Hongming, Fang, Tianqing, Tian, Ye, Yang, Yue, Ma, Kaixin, Pan, Xiaoman, Song, Yangqiu, Yu, Dong
Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent. Our code and data are available at https://github.com/zhaowei-wang-nlp/DivScene.
- North America > United States > Pennsylvania (0.04)
- North America > United States > New York (0.04)
- Workflow (0.94)
- Research Report > New Finding (0.46)
- Retail (1.00)
- Health & Medicine (0.68)
- Consumer Products & Services (0.68)
- Leisure & Entertainment (0.67)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.87)
Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition
Human activities often occur in specific scene contexts, e.g., playing basketball on a basketball court. The learned representation may not generalize well to new action classes or different tasks. In this paper, we propose to mitigate scene bias for video representation learning. Specifically, we augment the standard cross-entropy loss for action classification with 1) an adversarial loss for scene types and 2) a human mask confusion loss for videos where the human actors are masked out. These two losses encourage learning representations that are unable to predict the scene types and the correct actions when there is no evidence.
FRACTAL: An Ultra-Large-Scale Aerial Lidar Dataset for 3D Semantic Segmentation of Diverse Landscapes
Gaydon, Charles, Daab, Michel, Roche, Floryne
Mapping agencies are increasingly adopting Aerial Lidar Scanning (ALS) as a new tool to monitor territory and support public policies. Processing ALS data at scale requires efficient point classification methods that perform well over highly diverse territories. To evaluate them, researchers need large annotated Lidar datasets, however, current Lidar benchmark datasets have restricted scope and often cover a single urban area. To bridge this data gap, we present the FRench ALS Clouds from TArgeted Landscapes (FRACTAL) dataset: an ultra-large-scale aerial Lidar dataset made of 100,000 dense point clouds with high-quality labels for 7 semantic classes and spanning 250 km$^2$. FRACTAL is built upon France's nationwide open Lidar data. It achieves spatial and semantic diversity via a sampling scheme that explicitly concentrates rare classes and challenging landscapes from five French regions. It should support the development of 3D deep learning approaches for large-scale land monitoring. We describe the nature of the source data, the sampling workflow, the content of the resulting dataset, and provide an initial evaluation of segmentation performance using a performant 3D neural architecture.
- Europe > France (0.25)
- North America > United States > California (0.04)
Holodeck: Language Guided Generation of 3D Embodied AI Environments
Yang, Yue, Sun, Fan-Yun, Weihs, Luca, VanderBilt, Eli, Herrasti, Alvaro, Han, Winson, Wu, Jiajun, Haber, Nick, Krishna, Ranjay, Liu, Lingjie, Callison-Burch, Chris, Yatskar, Mark, Kembhavi, Aniruddha, Clark, Christopher
3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs for styles, and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly, we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes like music rooms and daycares without human-constructed data, which is a significant step forward in developing general-purpose embodied agents.
- Education (0.93)
- Leisure & Entertainment > Games > Computer Games (0.67)
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Cafagna, Michele, van Deemter, Kees, Gatt, Albert
Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)