Goto

Collaborating Authors

 Object-Oriented Architecture


Enabling Generic Robot Skill Implementation Using Object Oriented Programming

arXiv.org Artificial Intelligence

Developing robotic algorithms and integrating a robotic subsystem into a larger system can be a difficult task. Particularly in small and medium-sized enterprises (SMEs) where robotics expertise is lacking, implementing, maintaining and developing robotic systems can be a challenge. As a result, many companies rely on external expertise through system integrators, which, in some cases, can lead to vendor lock-in and external dependency. In the academic research on intelligent manufacturing systems, robots play a critical role in the design of robust autonomous systems. Similar challenges are faced by researchers who want to use robotic systems as a component in a larger smart system, without having to deal with the complexity and vastness of the robot interfaces in detail. In this paper, we propose a software framework that reduces the effort required to deploy a working robotic system. The focus is solely on providing a concept for simplifying the different interfaces of a modern robot system and using an abstraction layer for different manufacturers and models. The Python programming language is used to implement a prototype of the concept. The target system is a bin-picking cell containing a Yaskawa Motoman GP4.


Steerable Scene Generation with Post Training and Inference-Time Search

arXiv.org Artificial Intelligence

Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/


Few-Shot Pattern Detection via Template Matching and Regression

arXiv.org Artificial Intelligence

W e address the problem of few-shot pattern detection, which aims to detect all instances of a given pattern, typically represented by a few exemplars, from an input image. Although similar problems have been studied in few-shot object counting and detection (FSCD), previous methods and their benchmarks have narrowed patterns of interest to object categories and often fail to localize non-object patterns. In this work, we propose a simple yet effective detector based on template matching and regression, dubbed TMR. While previous FSCD methods typically represent target exemplars as spatially collapsed prototypes and lose structural information, we revisit classic template matching and regression. It effectively preserves and leverages the spatial layout of exemplars through a minimalistic structure with a small number of learnable convolutional or projection layers on top of a frozen backbone. W e also introduce a new dataset, dubbed RPINE, which covers a wider range of patterns than existing object-centric datasets. Our method outperforms the state-of-the-art methods on the three benchmarks, RPINE, FSCD-147, and FSCD-LVIS, and demonstrates strong generalization in cross-dataset evaluation.


An ablation study over different model architectures (Table (a)) shows that the chosen

Neural Information Processing Systems

FB15k's lack of hierarchy offers no advantage to hyperbolic embeddings, but its large number MuRP does not also set out to include MTL, but we hope to address this in future work. We will include all recommendations, e.g. However, we agree that it is important to compare models across a range of dimensionalities. Note that for MuRP with biases replaced by (transformed) norms, performance reduces (e.g. Multi-relational transforms and Justification for architecture: See "Architecture ablation study".



The 9th AI City Challenge

arXiv.org Artificial Intelligence

The ninth AI City Challenge continues to advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety. The 2025 edition featured four tracks and saw a 17% increase in participation, with 245 teams from 15 countries registered on the evaluation server. Public release of challenge datasets led to over 30,000 downloads to date. Track 1 focused on multi-class 3D multi-camera tracking, involving people, humanoids, autonomous mobile robots, and forklifts, using detailed calibration and 3D bounding box annotations. Track 2 tackled video question answering in traffic safety, with multi-camera incident understanding enriched by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic warehouse environments, requiring AI systems to interpret RGB-D inputs and answer spatial questions that combine perception, geometry, and language. Both Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4 emphasized efficient road object detection from fisheye cameras, supporting lightweight, real-time deployment on edge devices. The evaluation framework enforced submission limits and used a partially held-out test set to ensure fair benchmarking. Final rankings were revealed after the competition concluded, fostering reproducibility and mitigating overfitting. Several teams achieved top-tier results, setting new benchmarks in multiple tasks.



Supplementary Materials of Decoupling Features in Hierarchical Propagation for Video Object Segmentation

Neural Information Processing Systems

The optimization strategies and related hyper-parameters are also the same as AOT. The loss function is a 0.5:0.5 combination of BCE loss [ Such a process is necessary to keep enough long-term information and avoid facing out of memory when inferring long videos. The longest video in VOT 2020 contains 1,500 frames. We compare our DeAOT with more VOS methods in Table 2 and 1. VOS cases, including similar objects, occlusion, fast motion, motion blur, etc. A.4 Border Impact and Limitations The proposed DeAOT framework significantly improves VOS's performance, robustness, and robustness. As to limitations, the scenarios with multiple similar objects and severe occlusions are still very challenging for DeAOT and other VOS solutions.


Appendix of GLIPv2: Unifying Localization and Vision-Language Understanding

Neural Information Processing Systems

In Section 1, we provide more visualizations of our model's predictions on various localization and VL understanding tasks. In Section 2, we describe all our evaluated tasks and their dataset in detail. In Section 8, we give out a comparison for the model's inference speed. It has about 900k bounding box annotations for 80 object categories, with about 7.3 We predict use any-box protocol specified in MDETR. L VIS uses the same images as in COCO, re-annotated with more object categories.