Object-Oriented Architecture
Decouple before Align: Visual Disentanglement Enhances Prompt Tuning
Zhang, Fei, Zhou, Tianfei, Yao, Jiangchao, Zhang, Ya, Tsang, Ivor W., Wang, Yanfeng
Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at https://github.com/Ferenas/DAPT.
Towards a unified framework for programming paradigms: A systematic review of classification formalisms and methodological foundations
The rise of multi-paradigm languages challenges traditional classification methods, leading to practical software engineering issues like interoperability defects. This systematic literature review (SLR) maps the formal foundations of programming paradigms. Our objective is twofold: (1) to assess the state of the art of classification formalisms and their limitations, and (2) to identify the conceptual primitives and mathematical frameworks for a more powerful, reconstructive approach. Based on a synthesis of 74 primary studies, we find that existing taxonomies lack conceptual granularity, a unified formal basis, and struggle with hybrid languages. In response, our analysis reveals a strong convergence toward a compositional reconstruction of paradigms. This approach identifies a minimal set of orthogonal, atomic primitives and leverages mathematical frameworks, predominantly Type theory, Category theory and Unifying Theories of Programming (UTP), to formally guarantee their compositional properties. We conclude that the literature reflects a significant intellectual shift away from classification towards these promising formal, reconstructive frameworks. This review provides a map of this evolution and proposes a research agenda for their unification.
All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Han, Zongyan, Boudjoghra, Mohamed El Amine, Dong, Jiahua, Wang, Jinhong, Anwer, Rao Muhammad
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.
Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding
Zhang, Xiaojie, Wang, Yuanfei, Wu, Ruihai, Xu, Kunqi, Li, Yu, Xiang, Liuyu, Dong, Hao, He, Zhaofeng
Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy. T o address these challenges, we propose AdaRPG, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. T o support this, we construct a part-level affordance annotation dataset to train the af-fordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part af-fordance inference. Simulation and real-world experiments demonstrate AdaRPG's strong generalization ability across novel articulated object categories.
IndoorBEV: Joint Detection and Footprint Completion of Objects via Mask-based Prediction in Indoor Scenarios for Bird's-Eye View Perception
Li, Haichuan, Tian, Changda, Trahanias, Panos, Westerlund, Tomi
The deployment of autonomous robots in indoor environments necessitates precise and real-time perception of their surroundings to ensure safe and efficient navigation. Lidar sensors have emerged as a pivotal technology in this domain, offering high-resolution 3D point cloud data that is robust to varying lighting conditions and capable of capturing intricate spatial details. However, transforming this unstructured point cloud data into actionable representations for tasks such as object detection, segmentation, and navigation remains a formidable challenge, particularly given the complexity and clutter often found indoors. Recent advancements have sought to address these challenges. For instance, MakeWay [1] system introduces object-aware costmaps derived from lidar data to enhance proactive indoor navigation. Similarly, the L V -DOT framework [2] leverages a fusion of lidar and visual data to improve dynamic obstacle detection and tracking in indoor settings. These approaches underscore the potential of integrating machine learning techniques with lidar data to enhance indoor perception. Bird's-Eye View (BEV) representations naturally handles occlusions and provides a representation directly amenable to downstream robotic tasks like navigation and planning due to their ability to provide a top-down, spatially consistent view of the environment.
Discovering and using Spelke segments
Venkatesh, Rahul, Kotar, Klemen, Chen, Lilian Naing, Kim, Seungwoo, Wheeler, Luca Thomas, Watrous, Jared, Xu, Ashley, Ancone, Gia, Lee, Wanhee, Chen, Honglin, Bear, Daniel, Stojanov, Stefan, Yamins, Daniel
Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects--groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for "statistical counterfactual probing", where diverse "virtual pokes" are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.
Dissociating model architectures from inference computations
Parr et al., 2025 examines how auto-regressive and deep temporal models differ in their treatment of non-Markovian sequence modelling. Building on this, we highlight the need for dissociating model architectures, i.e., how the predictive distribution factorises, from the computations invoked at inference. We demonstrate that deep temporal computations are mimicked by autoregressive models by structuring context access during iterative inference. Using a transformer trained on next-token prediction, we show that inducing hierarchical temporal factorisation during iterative inference maintains predictive capacity while instantiating fewer computations. This emphasises that processes for constructing and refining predictions are not necessarily bound to their underlying model architectures.
Tree-SLAM: semantic object SLAM for efficient mapping of individual trees in orchards
Rapado-Rincon, David, Kootstra, Gert
Accurate mapping of individual trees is an important component for precision agriculture in orchards, as it allows autonomous robots to perform tasks like targeted operations or individual tree monitoring. However, creating these maps is challenging because GPS signals are often unreliable under dense tree canopies. Furthermore, standard Simultaneous Localization and Mapping (SLAM) approaches struggle in orchards because the repetitive appearance of trees can confuse the system, leading to mapping errors. To address this, we introduce Tree-SLAM, a semantic SLAM approach tailored for creating maps of individual trees in orchards. Utilizing RGB-D images, our method detects tree trunks with an instance segmentation model, estimates their location and re-identifies them using a cascade-graph-based data association algorithm. These re-identified trunks serve as landmarks in a factor graph framework that integrates noisy GPS signals, odometry, and trunk observations. The system produces maps of individual trees with a geo-localization error as low as 18 cm, which is less than 20% of the planting distance. The proposed method was validated on diverse datasets from apple and pear orchards across different seasons, demonstrating high mapping accuracy and robustness in scenarios with unreliable GPS signals. Keywords: semantic SLAM, agricultural robotics, multi-object tracking, factor graph 1. Introduction A significant decline in available agricultural labor presents a challenge for sustaining agricultural production, potentially leading to food losses [1, 2]. Automation and robotics are emerging as key technologies to address these issues, offering the potential to enhance productivity, by compensating for labor scarcity and optimizing farm management through data-driven insights [3, 4]. This is particularly relevant in high-value crops such as those found in orchards, where precise operations have the potential to improve efficiency and reduce labor needs. For autonomous robots to perform tasks effectively in orchards, such as targeted spraying or individual tree monitoring, they require a detailed map of the environment and the ability to determine their position within it.
SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation
However, due to their inherent sparsity, disorder, and lack of structure, instance-level semantic understanding of point clouds remains challenging - particularly under conditions of limited supervision and cross-modal semantic ambiguity. To address these issues, we propose SegV ec3D, a novel framework integrating attention mechanisms, embedding learning, and cross-modal alignment techniques for 3D point cloud instance segmentation. The proposed approach first builds a hierarchical instance feature extractor based on spatial adjacency and attention computation, enhancing the model's ability to capture fine-grained geometric structures. It then introduces a high-dimensional embedding space, enabling unsupervised instance segmentation through a contrastive-learning-based clustering mechanism. Furthermore, a shared cross-modal semantic space is constructed to align 3D data with natural language descriptions, allowing zero-shot understanding and retrieval of 3D objects given text queries. The model is ultimately deployed and validated in realistic scenarios, demonstrating strong generalizability and engineering feasibility. While recent methods like Mask3D [40] and ULIP [10][11] have advanced 3D segmentation and vision-language pre-training respectively, our approach uniquely integrates these domains by enabling instance segmentation with minimal labeling and directly aligning point clouds with language. Experimental evaluations confirm that the proposed method achieves high semantic discriminability, robust multi-modal alignment, and practical deployabil-ity. It supports weakly-supervised or unsupervised 3D instance understanding, providing a promising foundation for future multi-modal cognitive robotic systems.
ObjectRL: An Object-Oriented Reinforcement Learning Codebase
Baykal, Gulcin, Akgül, Abdullah, Haussmann, Manuel, Tasdighi, Bahareh, Werge, Nicklas, Wu, Yi-Shan, Kandemir, Melih
ObjectRL is an open-source Python codebase for deep reinforcement learning (RL), designed for research-oriented prototyping with minimal programming effort. Unlike existing codebases, ObjectRL is built on Object-Oriented Programming (OOP) principles, providing a clear structure that simplifies the implementation, modification, and evaluation of new algorithms. ObjectRL lowers the entry barrier for deep RL research by organizing best practices into explicit, clearly separated components, making them easier to understand and adapt. Each algorithmic component is a class with attributes that describe key RL concepts and methods that intuitively reflect their interactions. The class hierarchy closely follows common ontological relationships, enabling data encapsulation, inheritance, and polymorphism, which are core features of OOP. We demonstrate the efficiency of ObjectRL's design through representative use cases that highlight its flexibility and suitability for rapid prototyping. The documentation and source code are available at https://objectrl.readthedocs.io and https://github.com/adinlab/objectrl .