AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

Urban 3D Change Detection Using LiDAR Sensor for HD Map Maintenance and Smart Mobility

Albagami, Hezam, Wang, Haitian, Wang, Xinyu, Ibrahim, Muhammad, Malakan, Zainy M., Alqamdi, Abdullah M., Alghamdi, Mohammed H., Mian, Ajmal

arXiv.org Artificial IntelligenceOct-27-2025

High-definition 3D city maps underpin smart transportation, digital twins, and autonomous driving, where object level change detection across bi temporal LiDAR enables HD map maintenance, construction monitoring, and reliable localization. Classical DSM differencing and image based methods are sensitive to small vertical bias, ground slope, and viewpoint mismatch and yield cellwise outputs without object identity. Point based neural models and voxel encodings demand large memory, assume near perfect pre alignment, degrade thin structures, and seldom enforce class consistent association, which leaves split or merge cases unresolved and ignores uncertainty. We propose an object centric, uncertainty aware pipeline for city scale LiDAR that aligns epochs with multi resolution NDT followed by point to plane ICP, normalizes height, and derives a per location level of detection from registration covariance and surface roughness to calibrate decisions and suppress spurious changes. Geometry only proxies seed cross epoch associations that are refined by semantic and instance segmentation and a class constrained bipartite assignment with augmented dummies to handle splits and merges while preserving per class counts. Tiled processing bounds memory without eroding narrow ground changes, and instance level decisions combine 3D overlap, normal direction displacement, and height and volume differences with a histogram distance, all gated by the local level of detection to remain stable under partial overlap and sampling variation. On 15 representative Subiaco blocks the method attains 95.2% accuracy, 90.4% mF1, and 82.6% mIoU, exceeding Triplet KPConv by 0.2 percentage points in accuracy, 0.2 in mF1, and 0.8 in mIoU, with the largest gain on Decreased where IoU reaches 74.8% and improves by 7.6 points.

detection, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.21112

Country:

North America > United States (1.00)
Oceania > Australia (0.96)
Asia > Middle East > Saudi Arabia (0.16)

Genre: Research Report (0.50)

Industry:

Government > Regional Government (1.00)
Information Technology (0.88)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
(2 more...)

Add feedback

[De|Re]constructing VLMs' Reasoning in Counting

Alghisi, Simone, Roccabruna, Gabriel, Rizzoli, Massimo, Mousavi, Seyed Mahed, Riccardi, Giuseppe

arXiv.org Artificial IntelligenceOct-23-2025

Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.

accuracy, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.19555

Country:

Europe (0.46)
North America (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry: Food & Agriculture > Agriculture (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.68)
(2 more...)

Add feedback

DIV-Nav: Open-Vocabulary Spatial Relationships for Multi-Object Navigation

Ortega-Peimbert, Jesús, Busch, Finn Lukas, Homberger, Timon, Yang, Quantao, Andersson, Olov

arXiv.org Artificial IntelligenceOct-21-2025

Abstract-- Advances in open-vocabulary semantic mapping and object navigation have enabled robots to perform an informed search of their environment for an arbitrary object. However, such zero-shot object navigation is typically designed for simple queries with an object name like "television" or "blue rug". Here, we consider more complex free-text queries with spatial relationships, such as "find the remote on the table" while still leveraging robustness of a semantic map. We present DIV-Nav, a real-time navigation system that efficiently addresses this problem through a series of relaxations: i) Decomposing natural language instructions with complex spatial constraints into simpler object-level queries on a semantic map, ii) computing the Intersection of individual semantic belief maps to identify regions where all objects co-exist, and iii) V alidating the discovered objects against the original, complex spatial constrains via a L VLM. We further investigate how to adapt the frontier exploration objectives of online semantic mapping to such spatial search queries to more effectively guide the search process. Robots operating in human environments must interpret natural language commands that go beyond simple object identification. While a command like "find a chair" requires handling simple object classes only, real-world search instructions often specify spatial relationships: "go to the chair next to the desk," "find the towel in the bathroom," or "get the book on the nightstand."

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2510.16518

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
(4 more...)

Add feedback

Data-Driven Analysis of Intersectional Bias in Image Classification: A Framework with Bias-Weighted Augmentation

Yesmin, Farjana

arXiv.org Artificial IntelligenceOct-21-2025

Machine learning models trained on imbalanced datasets often exhibit intersectional biases--systematic errors arising from the interaction of multiple attributes such as object class and environmental conditions. This paper presents a data-driven framework for analyzing and mitigating such biases in image classification. We introduce the Intersec-tional Fairness Evaluation Framework (IFEF), which combines quantitative fairness metrics with interpretability tools to systematically identify bias patterns in model predictions. Building on this analysis, we propose Bias-Weighted Augmentation (BWA), a novel data augmentation strategy that adapts transformation intensities based on subgroup distribution statistics. Experiments on the Open Images V7 dataset with five object classes demonstrate that BWA improves accuracy for underrep-resented class-environment intersections by up to 24 percentage points while reducing fairness metric disparities by 35%. Statistical analysis across multiple independent runs confirms the significance of improvements (p < 0.05). Our methodology provides a replicable approach for analyzing and addressing intersectional biases in image classification systems.

artificial intelligence, machine learning, object-oriented architecture, (16 more...)

arXiv.org Artificial Intelligence

2510.16072

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.49)

Industry: Health & Medicine > Diagnostic Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.82)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.69)

Add feedback

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Xie, Chunyu, Wang, Bin, Kong, Fanjing, Li, Jincheng, Liang, Dawei, Ao, Ji, Leng, Dawei, Yin, Yuhui

arXiv.org Artificial IntelligenceOct-20-2025

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

artificial intelligence, natural language, object-oriented architecture, (19 more...)

arXiv.org Artificial Intelligence

2510.10921

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)

Add feedback

Data or Language Supervision: What Makes CLIP Better than DINO?

Liu, Yiming, Zhang, Yuhui, Ghosh, Dhruba, Schmidt, Ludwig, Yeung-Levy, Serena

arXiv.org Artificial IntelligenceOct-15-2025

CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.11835

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.34)

Add feedback

Product-oriented Product-Process-Resource Asset Network and its Representation in AutomationML for Asset Administration Shell

Strakosova, Sara, Novak, Petr, Kadera, Petr

arXiv.org Artificial IntelligenceOct-15-2025

Abstract--Current products, especially in the automotive sector, pose complex technical systems having a multi-disciplinary mechatronic nature. Industrial standards supporting system engineering and production typically (i) address the production phase only, but do not cover the complete product life cycle, and (ii) focus on production processes and resources rather than the products themselves. The presented approach is motivated by incorporating the impacts of the end-of-life phase of the product life cycle into the engineering phase. This paper proposes a modeling approach coming up from the Product-Process-Resource (PPR) modeling paradigm. It combines requirements on (i) respecting the product structure as a basis for the model, and (ii) incorporates repairing, remanufacturing, or upcycling within cyber-physical production systems. The proposed model called PoPAN should accompany the product during the entire life cycle as a digital shadow encapsulated within the Asset Administration Shell of a product. T o facilitate the adoption of the proposed paradigm, the paper also proposes serialization of the model in the AutomationML data format. The model is demonstrated on a use-case for disassembling electric vehicle batteries to support their remanufacturing for stationary battery applications.

artificial intelligence, asset administration shell, object-oriented architecture, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ETFA61755.2024.10710680

2510.00933

Country: Europe > Czechia (0.15)

Genre: Research Report (0.83)

Industry:

Transportation > Ground > Road (1.00)
Transportation > Electric Vehicle (1.00)
Automobiles & Trucks (1.00)
Energy > Energy Storage (0.88)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.46)

Add feedback

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Zhong, Weipeng, Cao, Peizhou, Jin, Yichen, Luo, Li, Cai, Wenzhe, Lin, Jingli, Wang, Hanqing, Lyu, Zhaoyang, Wang, Tai, Dai, Bo, Xu, Xudong, Pang, Jiangmiao

arXiv.org Artificial IntelligenceOct-15-2025

The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

machine learning, natural language, object-oriented architecture, (20 more...)

arXiv.org Artificial Intelligence

2509.10813

Genre: Research Report (1.00)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)

Add feedback

GRIP: A Unified Framework for Grid-Based Relay and Co-Occurrence-Aware Planning in Dynamic Environments

Alanazi, Ahmed, Ho, Duy, Lee, Yugyung

arXiv.org Artificial IntelligenceOct-14-2025

Robots navigating dynamic, cluttered, and semantically complex environments must integrate perception, symbolic reasoning, and spatial planning to generalize across diverse layouts and object categories. Existing methods often rely on static priors or limited memory, constraining adaptability under partial observability and semantic ambiguity. We present GRIP, Grid-based Relay with Intermediate Planning, a unified, modular framework with three scalable variants: GRIP-L (Lightweight), optimized for symbolic navigation via semantic occupancy grids; GRIP-F (Full), supporting multi-hop anchor chaining and LLM-based introspection; and GRIP-R (Real-World), enabling physical robot deployment under perceptual uncertainty. GRIP integrates dynamic 2D grid construction, open-vocabulary object grounding, co-occurrence-aware symbolic planning, and hybrid policy execution using behavioral cloning, D* search, and grid-conditioned control. Empirical results on AI2-THOR and RoboTHOR benchmarks show that GRIP achieves up to 9.6% higher success rates and over $2\times$ improvement in path efficiency (SPL and SAE) on long-horizon tasks. Qualitative analyses reveal interpretable symbolic plans in ambiguous scenes. Real-world deployment on a Jetbot further validates GRIP's generalization under sensor noise and environmental variation. These results position GRIP as a robust, scalable, and explainable framework bridging simulation and real-world navigation.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2510.10865

Country: North America > United States (0.46)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
(3 more...)

Add feedback

From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries

Hsu, Joy, Jin, Emily, Wu, Jiajun, Mitra, Niloy J.

arXiv.org Artificial IntelligenceOct-14-2025

Real-world scenes, such as those in ScanNet, are difficult to capture, with highly limited data available. Generating realistic scenes with varied object poses remains an open and challenging task. In this work, we propose FactoredScenes, a framework that synthesizes realistic 3D scenes by leveraging the underlying structure of rooms while learning the variation of object poses from lived-in scenes. We introduce a factored representation that decomposes scenes into hierarchically organized concepts of room programs and object poses. To encode structure, FactoredScenes learns a library of functions capturing reusable layout patterns from which scenes are drawn, then uses large language models to generate high-level programs, regularized by the learned library. To represent scene variations, FactoredScenes learns a program-conditioned model to hierarchically predict object poses, and retrieves and places 3D objects in a scene. We show that FactoredScenes generates realistic, real-world rooms that are difficult to distinguish from real ScanNet scenes.

machine learning, natural language, object-oriented architecture, (20 more...)

arXiv.org Artificial Intelligence

2510.10292

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback