AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models

Laria, Héctor, Gomez-Villa, Alexandra, Qin, Jiang, Butt, Muhammad Atif, Raducanu, Bogdan, Vazquez-Corral, Javier, van de Weijer, Joost, Wang, Kai

arXiv.org Artificial IntelligenceMar-12-2025

Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.

color control, colorwave, diffusion model, (14 more...)

arXiv.org Artificial Intelligence

2503.09864

Country:

North America > United States > California (0.04)
Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects

Lee, Wonkwang, Jeong, Jongwon, Moon, Taehong, Kim, Hyeon-Jong, Kim, Jaehyeon, Kim, Gunhee, Lee, Byeong-Uk

arXiv.org Artificial IntelligenceMar-6-2025

Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of high-quality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects. To address these challenges, we contribute the following: First, we augment the Truebones Zoo dataset, a high-quality animal motion dataset covering over 70 species, by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis. Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations. Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures. Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on this link: t2m4lvo.github.io

diffusion model, motion synthesis, synthesis, (16 more...)

arXiv.org Artificial Intelligence

2503.04257

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

Ye, Xin, Yaman, Burhaneddin, Cheng, Sheng, Tao, Feng, Mallik, Abhirup, Ren, Liu

arXiv.org Artificial IntelligenceFeb-26-2025

Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.

bev feature map, bev model, bevdiffuser, (13 more...)

arXiv.org Artificial Intelligence

2502.19694

Country: North America (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Ground > Road (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.36)

Add feedback

GroundCap: A Visually Grounded Image Captioning Dataset

Oliveira, Daniel A. P., Teodoro, Lourenço, de Matos, David Martins

arXiv.org Artificial IntelligenceFeb-19-2025

Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references. Introduction One of the primary goals combining computer vision and natural language processing is to enable machines to understand and communicate about visual scenes. This objective encompasses numerous tasks, including recognizing objects, describing their attributes and relationships, and providing contextually relevant descriptions of scenes [1]. While significant progress has been made in image classification, object detection, and image captioning, a critical aspect of human visual communication remains under-explored: the ability to ground language to specific elements within an image. Consider a scenario where two people are discussing a crowded street scene. One might say, "Look at that car." to which the other might respond, "Which one?". The first person would likely point to the specific car they're referring to while simultaneously describing it with more detail.

caption, computer vision, groundcap, (15 more...)

arXiv.org Artificial Intelligence

2502.13898

Country:

Europe > Portugal > Lisbon > Lisbon (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New Jersey (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
(2 more...)

Add feedback

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Qi, Zekun, Zhang, Wenyao, Ding, Yufei, Dong, Runpei, Yu, Xinqiang, Li, Jingwen, Xu, Lingyun, Li, Baoyu, He, Xialin, Fan, Guofan, Zhang, Jiazhao, He, Jiawei, Gu, Jiayuan, Jin, Xin, Ma, Kaisheng, Zhang, Zhizheng, Wang, He, Yi, Li

arXiv.org Artificial IntelligenceFeb-18-2025

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

large language model, machine learning, orientation, (23 more...)

arXiv.org Artificial Intelligence

2502.13143

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.67)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.67)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.45)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(5 more...)

Add feedback

Learning models of object structure

Joseph Schlecht, Kobus Barnard

Neural Information Processing SystemsFeb-11-2025, 18:04:08 GMT

We present an approach for learning stochastic geometric models of object categories from single view images. We focus here on models expressible as a spatially contiguous assemblage of blocks. Model topologies are learned across groups of images, and one or more such topologies is linked to an object category (e.g.

category, machine learning, object-oriented architecture, (20 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Arizona (0.05)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.69)
(3 more...)

Add feedback

Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling

Qiu, Xiaowen, Yang, Jincheng, Wang, Yian, Chen, Zhehuan, Wang, Yufei, Wang, Tsun-Hsuan, Xian, Zhou, Gan, Chuang

arXiv.org Artificial IntelligenceFeb-4-2025

3D articulated objects modeling has long been a challenging problem, since it requires to capture both accurate surface geometries and semantically meaningful and spatially precise structures, parts, and joints. Existing methods heavily depend on training data from a limited set of handcrafted articulated object categories (e.g., cabinets and drawers), which restricts their ability to model a wide range of articulated objects in an open-vocabulary context. To address these limitations, we propose Articulate Anymesh, an automated framework that is able to convert any rigid 3D mesh into its articulated counterpart in an open-vocabulary manner. Given a 3D mesh, our framework utilizes advanced Vision-Language Models and visual prompting techniques to extract semantic information, allowing for both the segmentation of object parts and the construction of functional joints. Our experiments show that Articulate Anymesh can generate large-scale, high-quality 3D articulated objects, including tools, toys, mechanical devices, and vehicles, significantly expanding the coverage of existing 3D articulated object datasets. Additionally, we show that these generated assets can facilitate the acquisition of new articulated object manipulation skills in simulation, which can then be transferred to a real robotic system. Our Github website is https://articulate-anymesh.github.io.

machine learning, natural language, object-oriented architecture, (17 more...)

arXiv.org Artificial Intelligence

2502.0259

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Kumari, Nupur, Yin, Xi, Zhu, Jun-Yan, Misra, Ishan, Azadi, Samaneh

arXiv.org Artificial IntelligenceFeb-3-2025

Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments, we show that our model, trained on the synthetic dataset with the proposed encoder and inference algorithm, outperforms existing tuning-free methods on standard customization benchmarks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2502.0172

Country:

North America > United States > Maine (0.04)
Asia > South Korea (0.04)
Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
(2 more...)

Add feedback

Reviews: Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

Neural Information Processing SystemsJan-25-2025, 03:06:41 GMT

Cons/Questions: • Even though the authors claim that the method is robust to moving background, what happens if the background contains motion from similar objects but in different directions? For example, consider a street scene where the background may contain several cars and/or pedestrians moving in all directions.

background, guiding class-conditional video prediction, unsupervised keypoint learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.37)

Add feedback

Review for NeurIPS paper: Object Goal Navigation using Goal-Oriented Semantic Exploration

Neural Information Processing SystemsJan-23-2025, 00:59:09 GMT

Summary and Contributions: This paper presents an extension to recent work on Active Neural SLAM [1], where semantic information about object categories is explicitly incorporated into the model. The extensions in the model architecture provide explicit semantic information about the various objects of the scene in the generated 2D map, that allows an agent to navigate in its environment and find a specified goal object much efficiently compared to baselines. Some of these baselines use - and others do not - semantic information. The comparison was performed using Gibson [2] and Matterport3D (MP3D) [3], which include 3D reconstructions of real environments. Training was performed on 86 scenes and testing on 16.

goal-oriented semantic exploration, object goal navigation, semantic information, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.60)

Add feedback