AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

$NavA^3$: Understanding Any Instruction, Navigating Anywhere, Finding Anything

Zhang, Lingfeng, Hao, Xiaoshuai, Tang, Yingbo, Fu, Haoxiang, Zheng, Xinyu, Wang, Pengwei, Wang, Zhongyuan, Ding, Wenbo, Zhang, Shanghang

arXiv.org Artificial IntelligenceAug-7-2025

Embodied navigation is a fundamental capability of embodied intelligence, enabling robots to move and interact within physical environments. However, existing navigation tasks primarily focus on predefined object navigation or instruction following, which significantly differs from human needs in real-world scenarios involving complex, open-ended scenes. To bridge this gap, we introduce a challenging long-horizon navigation task that requires understanding high-level human instructions and performing spatial-aware object navigation in real-world environments. Existing embodied navigation methods struggle with such tasks due to their limitations in comprehending high-level human instructions and localizing objects with an open vocabulary. In this paper, we propose $NavA^3$, a hierarchical framework divided into two stages: global and local policies. In the global policy, we leverage the reasoning capabilities of Reasoning-VLM to parse high-level human instructions and integrate them with global 3D scene views. This allows us to reason and navigate to regions most likely to contain the goal object. In the local policy, we have collected a dataset of 1.0 million samples of spatial-aware object affordances to train the NaviAfford model (PointingVLM), which provides robust open-vocabulary object localization and spatial awareness for precise goal identification and navigation in complex environments. Extensive experiments demonstrate that $NavA^3$ achieves SOTA results in navigation performance and can successfully complete longhorizon navigation tasks across different robot embodiments in real-world settings, paving the way for universal embodied navigation. The dataset and code will be made available. Project website: https://NavigationA3.github.io/.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.04598

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(4 more...)

Add feedback

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Zhou, Jinxing, Zhou, Yanghao, Han, Mingfei, Wang, Tong, Chang, Xiaojun, Cholakkal, Hisham, Anwer, Rao Muhammad

arXiv.org Artificial IntelligenceAug-7-2025

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.04418

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
(2 more...)

Add feedback

Decouple before Align: Visual Disentanglement Enhances Prompt Tuning

Zhang, Fei, Zhou, Tianfei, Yao, Jiangchao, Zhang, Ya, Tsang, Ivor W., Wang, Yanfeng

arXiv.org Artificial IntelligenceAug-4-2025

Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at https://github.com/Ferenas/DAPT.

large language model, machine learning, recognition, (14 more...)

arXiv.org Artificial Intelligence

2508.00395

Country: Asia > China (0.48)

Genre: Research Report (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

Add feedback

Towards a unified framework for programming paradigms: A systematic review of classification formalisms and methodological foundations

Vandeloise, Mikel

arXiv.org Artificial IntelligenceAug-4-2025

The rise of multi-paradigm languages challenges traditional classification methods, leading to practical software engineering issues like interoperability defects. This systematic literature review (SLR) maps the formal foundations of programming paradigms. Our objective is twofold: (1) to assess the state of the art of classification formalisms and their limitations, and (2) to identify the conceptual primitives and mathematical frameworks for a more powerful, reconstructive approach. Based on a synthesis of 74 primary studies, we find that existing taxonomies lack conceptual granularity, a unified formal basis, and struggle with hybrid languages. In response, our analysis reveals a strong convergence toward a compositional reconstruction of paradigms. This approach identifies a minimal set of orthogonal, atomic primitives and leverages mathematical frameworks, predominantly Type theory, Category theory and Unifying Theories of Programming (UTP), to formally guarantee their compositional properties. We conclude that the literature reflects a significant intellectual shift away from classification towards these promising formal, reconstructive frameworks. This review provides a map of this evolution and proposes a research agenda for their unification.

logic & formal reasoning, paradigm, programming language, (22 more...)

arXiv.org Artificial Intelligence

2508.00534

Country:

Europe > United Kingdom > England (0.46)
North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Software Engineering (1.00)
Information Technology > Information Management (1.00)
(2 more...)

Add feedback

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Han, Zongyan, Boudjoghra, Mohamed El Amine, Dong, Jiahua, Wang, Jinhong, Anwer, Rao Muhammad

arXiv.org Artificial IntelligenceJul-28-2025

Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.

large language model, machine learning, segmentation, (13 more...)

arXiv.org Artificial Intelligence

2507.05211

Country: Asia > Middle East > UAE (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.88)

Add feedback

Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding

Zhang, Xiaojie, Wang, Yuanfei, Wu, Ruihai, Xu, Kunqi, Li, Yu, Xiang, Liuyu, Dong, Hao, He, Zhaofeng

arXiv.org Artificial IntelligenceJul-25-2025

Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy. T o address these challenges, we propose AdaRPG, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. T o support this, we construct a part-level affordance annotation dataset to train the af-fordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part af-fordance inference. Simulation and real-world experiments demonstrate AdaRPG's strong generalization ability across novel articulated object categories.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.18276

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.36)

Add feedback

IndoorBEV: Joint Detection and Footprint Completion of Objects via Mask-based Prediction in Indoor Scenarios for Bird's-Eye View Perception

Li, Haichuan, Tian, Changda, Trahanias, Panos, Westerlund, Tomi

arXiv.org Artificial IntelligenceJul-24-2025

The deployment of autonomous robots in indoor environments necessitates precise and real-time perception of their surroundings to ensure safe and efficient navigation. Lidar sensors have emerged as a pivotal technology in this domain, offering high-resolution 3D point cloud data that is robust to varying lighting conditions and capable of capturing intricate spatial details. However, transforming this unstructured point cloud data into actionable representations for tasks such as object detection, segmentation, and navigation remains a formidable challenge, particularly given the complexity and clutter often found indoors. Recent advancements have sought to address these challenges. For instance, MakeWay [1] system introduces object-aware costmaps derived from lidar data to enhance proactive indoor navigation. Similarly, the L V -DOT framework [2] leverages a fusion of lidar and visual data to improve dynamic obstacle detection and tracking in indoor settings. These approaches underscore the potential of integrating machine learning techniques with lidar data to enhance indoor perception. Bird's-Eye View (BEV) representations naturally handles occlusions and provides a representation directly amenable to downstream robotic tasks like navigation and planning due to their ability to provide a top-down, spatially consistent view of the environment.

artificial intelligence, machine learning, object-oriented architecture, (19 more...)

arXiv.org Artificial Intelligence

2507.17445

Country: Europe (0.46)

Genre: Research Report (0.50)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Discovering and using Spelke segments

Venkatesh, Rahul, Kotar, Klemen, Chen, Lilian Naing, Kim, Seungwoo, Wheeler, Luca Thomas, Watrous, Jared, Xu, Ashley, Ancone, Gia, Lee, Wanhee, Chen, Honglin, Bear, Daniel, Stojanov, Stefan, Yamins, Daniel

arXiv.org Artificial IntelligenceJul-23-2025

Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects--groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for "statistical counterfactual probing", where diverse "virtual pokes" are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.16038

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.46)

Add feedback

Dissociating model architectures from inference computations

Sajid, Noor, Medrano, Johan

arXiv.org Artificial IntelligenceJul-22-2025

Parr et al., 2025 examines how auto-regressive and deep temporal models differ in their treatment of non-Markovian sequence modelling. Building on this, we highlight the need for dissociating model architectures, i.e., how the predictive distribution factorises, from the computations invoked at inference. We demonstrate that deep temporal computations are mimicked by autoregressive models by structuring context access during iterative inference. Using a transformer trained on next-token prediction, we show that inducing hierarchical temporal factorisation during iterative inference maintains predictive capacity while instantiating fewer computations. This emphasises that processes for constructing and refining predictions are not necessarily bound to their underlying model architectures.

artificial intelligence, model architecture, object-oriented architecture, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1080/17588928.2025.2532604

2507.15776

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.15)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.99)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.99)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.83)

Add feedback

Tree-SLAM: semantic object SLAM for efficient mapping of individual trees in orchards

Rapado-Rincon, David, Kootstra, Gert

arXiv.org Artificial IntelligenceJul-17-2025

Accurate mapping of individual trees is an important component for precision agriculture in orchards, as it allows autonomous robots to perform tasks like targeted operations or individual tree monitoring. However, creating these maps is challenging because GPS signals are often unreliable under dense tree canopies. Furthermore, standard Simultaneous Localization and Mapping (SLAM) approaches struggle in orchards because the repetitive appearance of trees can confuse the system, leading to mapping errors. To address this, we introduce Tree-SLAM, a semantic SLAM approach tailored for creating maps of individual trees in orchards. Utilizing RGB-D images, our method detects tree trunks with an instance segmentation model, estimates their location and re-identifies them using a cascade-graph-based data association algorithm. These re-identified trunks serve as landmarks in a factor graph framework that integrates noisy GPS signals, odometry, and trunk observations. The system produces maps of individual trees with a geo-localization error as low as 18 cm, which is less than 20% of the planting distance. The proposed method was validated on diverse datasets from apple and pear orchards across different seasons, demonstrating high mapping accuracy and robustness in scenarios with unreliable GPS signals. Keywords: semantic SLAM, agricultural robotics, multi-object tracking, factor graph 1. Introduction A significant decline in available agricultural labor presents a challenge for sustaining agricultural production, potentially leading to food losses [1, 2]. Automation and robotics are emerging as key technologies to address these issues, offering the potential to enhance productivity, by compensating for labor scarcity and optimizing farm management through data-driven insights [3, 4]. This is particularly relevant in high-value crops such as those found in orchards, where precise operations have the potential to improve efficiency and reduce labor needs. For autonomous robots to perform tasks effectively in orchards, such as targeted spraying or individual tree monitoring, they require a detailed map of the environment and the ability to determine their position within it.

detection, machine learning, object-oriented architecture, (19 more...)

arXiv.org Artificial Intelligence

2507.12093

Genre: Research Report > New Finding (0.68)

Industry: Food & Agriculture > Agriculture (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback