AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

Articulate your NeRF: Unsupervised articulated object modeling via conditional view synthesis

Neural Information Processing SystemsMar-17-2025, 21:06:51 GMT

We propose a novel unsupervised method to learn pose and part-segmentation of articulated objects with rigid parts. Given two observations of an object in different articulation states, our method learns the geometry and appearance of object parts by using an implicit model from the first observation, distills the part segmentation and articulation from the second observation while rendering the latter observation. Additionally, to tackle the complexities in the joint optimization of part segmentation and articulation, we propose a voxel grid based initialization strategy and a decoupled optimization procedure. Compared to the prior unsupervised work, our model obtains significantly better performance, generalizes to objects with multiple parts while it can be efficiently from few views for the latter observation.

artificial intelligence, conditional view synthesis, unsupervised articulated, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.40)

Add feedback

Evaluating the Application of SOLID Principles in Modern AI Framework Architectures

Shrestha, Jonesh

arXiv.org Artificial IntelligenceMar-17-2025

This research evaluates the extent to which modern AI frameworks, specifically TensorFlow and scikit-learn, adhere to the SOLID design principles - Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, and Dependency Inversion. Analyzing the frameworks architectural documentation and design philosophies, this research investigates architectural trade-offs when balancing software engineering best practices with AI-specific needs. I examined each frameworks documentation, source code, and architectural components to evaluate their adherence to these principles. The results show that both frameworks adopt certain aspects of SOLID design principles but make intentional trade-offs to address performance, scalability, and the experimental nature of AI development. TensorFlow focuses on performance and scalability, sometimes sacrificing strict adherence to principles like Single Responsibility and Interface Segregation. While scikit-learns design philosophy aligns more closely with SOLID principles through consistent interfaces and composition principles, sticking closer to SOLID guidelines but with occasional deviations for performance optimizations and scalability. This research discovered that applying SOLID principles in AI frameworks depends on context, as performance, scalability, and flexibility often require deviations from traditional software engineering principles. This research contributes to understanding how domain-specific constraints influence architectural decisions in modern AI frameworks and how these frameworks strategically adapted design choices to effectively balance these contradicting requirements.

artificial intelligence, machine learning, object-oriented architecture, (14 more...)

arXiv.org Artificial Intelligence

2503.13786

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval

Wagner, Stefan Sylvius, Harmeling, Stefan

arXiv.org Artificial IntelligenceMar-15-2025

Object-centric learning is fundamental to human vision and crucial for models requiring complex reasoning. Traditional approaches rely on slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO have shown emergent object understanding. However, DINO representations primarily capture global scene features, often confounding individual object attributes. We investigate the effectiveness of DINO representations and slot-based methods for multi-object instance retrieval. Our findings reveal that DINO representations excel at capturing global object attributes such as object shape and size, but struggle with object-level details like colour, whereas slot-based representations struggle at both global and object-level understanding. To address this, we propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors from a Variational Autoencoder trained on segmented image patches that are extracted from the DINO features. This approach improves multi-object instance retrieval performance, bridging the gap between global scene understanding and fine-grained object representation without requiring full model retraining.

artificial intelligence, machine learning, object-oriented architecture, (17 more...)

arXiv.org Artificial Intelligence

2503.09867

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.90)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

CHOrD: Generation of Collision-Free, House-Scale, and Organized Digital Twins for 3D Indoor Scenes with Controllable Floor Plans and Optimal Layouts

Su, Chong, Fu, Yingbin, Hu, Zheyuan, Yang, Jing, Hanji, Param, Wang, Shaojun, Zhao, Xuan, Öztireli, Cengiz, Zhong, Fangcheng

arXiv.org Artificial IntelligenceMar-14-2025

We introduce CHOrD, a novel framework for scalable synthesis of 3D indoor scenes, designed to create house-scale, collision-free, and hierarchically structured indoor digital twins. In contrast to existing methods that directly synthesize the scene layout as a scene graph or object list, CHOrD incorporates a 2D image-based intermediate layout representation, enabling effective prevention of collision artifacts by successfully capturing them as out-of-distribution (OOD) scenarios during generation. Furthermore, unlike existing methods, CHOrD is capable of generating scene layouts that adhere to complex floor plans with multi-modal controls, enabling the creation of coherent, house-wide layouts robust to both geometric and semantic variations in room structures. Additionally, we propose a novel dataset with expanded coverage of household items and room configurations, as well as significantly improved data quality. CHOrD demonstrates state-of-the-art performance on both the 3D-FRONT and our proposed datasets, delivering photorealistic, spatially coherent indoor scene synthesis adaptable to arbitrary floor plan variations.

machine learning, natural language, object-oriented architecture, (19 more...)

arXiv.org Artificial Intelligence

2503.11958

Country: Europe > Netherlands (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Robots (0.68)
(2 more...)

Add feedback

Expelled! review – turning the tables on the private school class hierarchy

The GuardianMar-12-2025, 08:00:11 GMT

As with seemingly everything in the UK, it all comes back to the class system. Verity Amersham, a scholarship student at Miss Mulligatawney's School for Promising Girls, is accused of pushing the hockey captain out of a window, and the school's fearsome headmistress is determined to expel her despite the flimsiest evidence. When Verity protests her innocence, Miss Mulligatawney remains unpersuaded, spelling out her reasoning in plain terms: as a northerner with working-class parents, Verity simply isn't the "right sort". The injustice of it all is a potent driver, ensuring I set about my goal of preventing Verity's expulsion with determined zeal, much like Matilda defying the hateful Miss Trunchbull. As in developer Inkle's 2021 game Overboard!, you're given a time limit to work within and a handful of areas to move between, from the library to the sick room (AKA the "san", where the school's grumpy matron lurks). Each area has characters to talk to and objects to find, and each action moves the clock forward.

artificial intelligence, object-oriented architecture, private school class hierarchy, (7 more...)

The Guardian

Country: Europe > United Kingdom (0.26)

Industry: Education > Educational Setting (0.87)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.40)

Add feedback

Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models

Laria, Héctor, Gomez-Villa, Alexandra, Qin, Jiang, Butt, Muhammad Atif, Raducanu, Bogdan, Vazquez-Corral, Javier, van de Weijer, Joost, Wang, Kai

arXiv.org Artificial IntelligenceMar-12-2025

Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.

diffusion model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.09864

Country: Europe > Spain (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects

Lee, Wonkwang, Jeong, Jongwon, Moon, Taehong, Kim, Hyeon-Jong, Kim, Jaehyeon, Kim, Gunhee, Lee, Byeong-Uk

arXiv.org Artificial IntelligenceMar-6-2025

Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of high-quality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects. To address these challenges, we contribute the following: First, we augment the Truebones Zoo dataset, a high-quality animal motion dataset covering over 70 species, by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis. Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations. Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures. Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on this link: t2m4lvo.github.io

artificial intelligence, machine learning, object-oriented architecture, (19 more...)

arXiv.org Artificial Intelligence

2503.04257

Country: Asia (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

Ye, Xin, Yaman, Burhaneddin, Cheng, Sheng, Tao, Feng, Mallik, Abhirup, Ren, Liu

arXiv.org Artificial IntelligenceFeb-26-2025

Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.

artificial intelligence, machine learning, object-oriented architecture, (16 more...)

arXiv.org Artificial Intelligence

2502.19694

Genre: Research Report (1.00)

Industry: Transportation > Ground > Road (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.46)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.36)

Add feedback

GroundCap: A Visually Grounded Image Captioning Dataset

Oliveira, Daniel A. P., Teodoro, Lourenço, de Matos, David Martins

arXiv.org Artificial IntelligenceFeb-19-2025

Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references. Introduction One of the primary goals combining computer vision and natural language processing is to enable machines to understand and communicate about visual scenes. This objective encompasses numerous tasks, including recognizing objects, describing their attributes and relationships, and providing contextually relevant descriptions of scenes [1]. While significant progress has been made in image classification, object detection, and image captioning, a critical aspect of human visual communication remains under-explored: the ability to ground language to specific elements within an image. Consider a scenario where two people are discussing a crowded street scene. One might say, "Look at that car." to which the other might respond, "Which one?". The first person would likely point to the specific car they're referring to while simultaneously describing it with more detail.

caption, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2502.13898

Country:

Europe (0.93)
North America > United States > Pennsylvania (0.14)
North America > United States > Michigan (0.14)
Asia > Middle East > Qatar (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
(3 more...)

Add feedback

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Qi, Zekun, Zhang, Wenyao, Ding, Yufei, Dong, Runpei, Yu, Xinqiang, Li, Jingwen, Xu, Lingyun, Li, Baoyu, He, Xialin, Fan, Guofan, Zhang, Jiazhao, He, Jiawei, Gu, Jiayuan, Jin, Xin, Ma, Kaisheng, Zhang, Zhizheng, Wang, He, Yi, Li

arXiv.org Artificial IntelligenceFeb-18-2025

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

large language model, machine learning, orientation, (23 more...)

arXiv.org Artificial Intelligence

2502.13143

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.67)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.67)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.45)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(5 more...)

Add feedback