Object-Oriented Architecture
BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
Ye, Xin, Yaman, Burhaneddin, Cheng, Sheng, Tao, Feng, Mallik, Abhirup, Ren, Liu
Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.
GroundCap: A Visually Grounded Image Captioning Dataset
Oliveira, Daniel A. P., Teodoro, Lourenço, de Matos, David Martins
Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references. Introduction One of the primary goals combining computer vision and natural language processing is to enable machines to understand and communicate about visual scenes. This objective encompasses numerous tasks, including recognizing objects, describing their attributes and relationships, and providing contextually relevant descriptions of scenes [1]. While significant progress has been made in image classification, object detection, and image captioning, a critical aspect of human visual communication remains under-explored: the ability to ground language to specific elements within an image. Consider a scenario where two people are discussing a crowded street scene. One might say, "Look at that car." to which the other might respond, "Which one?". The first person would likely point to the specific car they're referring to while simultaneously describing it with more detail.
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Qi, Zekun, Zhang, Wenyao, Ding, Yufei, Dong, Runpei, Yu, Xinqiang, Li, Jingwen, Xu, Lingyun, Li, Baoyu, He, Xialin, Fan, Guofan, Zhang, Jiazhao, He, Jiawei, Gu, Jiayuan, Jin, Xin, Ma, Kaisheng, Zhang, Zhizheng, Wang, He, Yi, Li
Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.
Learning models of object structure
Joseph Schlecht, Kobus Barnard
We present an approach for learning stochastic geometric models of object categories from single view images. We focus here on models expressible as a spatially contiguous assemblage of blocks. Model topologies are learned across groups of images, and one or more such topologies is linked to an object category (e.g.
Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling
Qiu, Xiaowen, Yang, Jincheng, Wang, Yian, Chen, Zhehuan, Wang, Yufei, Wang, Tsun-Hsuan, Xian, Zhou, Gan, Chuang
3D articulated objects modeling has long been a challenging problem, since it requires to capture both accurate surface geometries and semantically meaningful and spatially precise structures, parts, and joints. Existing methods heavily depend on training data from a limited set of handcrafted articulated object categories (e.g., cabinets and drawers), which restricts their ability to model a wide range of articulated objects in an open-vocabulary context. To address these limitations, we propose Articulate Anymesh, an automated framework that is able to convert any rigid 3D mesh into its articulated counterpart in an open-vocabulary manner. Given a 3D mesh, our framework utilizes advanced Vision-Language Models and visual prompting techniques to extract semantic information, allowing for both the segmentation of object parts and the construction of functional joints. Our experiments show that Articulate Anymesh can generate large-scale, high-quality 3D articulated objects, including tools, toys, mechanical devices, and vehicles, significantly expanding the coverage of existing 3D articulated object datasets. Additionally, we show that these generated assets can facilitate the acquisition of new articulated object manipulation skills in simulation, which can then be transferred to a real robotic system. Our Github website is https://articulate-anymesh.github.io.
Generating Multi-Image Synthetic Data for Text-to-Image Customization
Kumari, Nupur, Yin, Xi, Zhu, Jun-Yan, Misra, Ishan, Azadi, Samaneh
Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments, we show that our model, trained on the synthetic dataset with the proposed encoder and inference algorithm, outperforms existing tuning-free methods on standard customization benchmarks.
Reviews: Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction
Cons/Questions: • Even though the authors claim that the method is robust to moving background, what happens if the background contains motion from similar objects but in different directions? For example, consider a street scene where the background may contain several cars and/or pedestrians moving in all directions.
Review for NeurIPS paper: Object Goal Navigation using Goal-Oriented Semantic Exploration
Summary and Contributions: This paper presents an extension to recent work on Active Neural SLAM [1], where semantic information about object categories is explicitly incorporated into the model. The extensions in the model architecture provide explicit semantic information about the various objects of the scene in the generated 2D map, that allows an agent to navigate in its environment and find a specified goal object much efficiently compared to baselines. Some of these baselines use - and others do not - semantic information. The comparison was performed using Gibson [2] and Matterport3D (MP3D) [3], which include 3D reconstructions of real environments. Training was performed on 86 scenes and testing on 16.
Learning models of object structure
We present an approach for learning stochastic geometric models of object categories from single view images. We focus here on models expressible as a spatially contiguous assemblage of blocks. Model topologies are learned across groups of images, and one or more such topologies is linked to an object category (e.g. Fitting learned topologies to an image can be used to identify the object class, as well as detail its geometry. The latter goes beyond labeling objects, as it provides the geometric structure of particular instances.
Are These the Same Apple? Comparing Images Based on Object Intrinsics
The human visual system can effortlessly recognize an object under different extrinsic factors such as lighting, object poses, and background, yet current computer vision systems often struggle with these variations. An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics. To benchmark such measurements, we collect the Common paired objects Under differenT Extrinsics (CUTE) dataset of 18, 000 images of 180 objects under different extrinsic factors such as lighting, poses, and imaging conditions.