clutter
Out-of-Distribution Radar Detection with Complex VAEs: Theory, Whitening, and ANMF Fusion
Rouzoumka, Yadang Alexis, Pinsolle, Jean, Terreaux, Eugénie, Morisseau, Christèle, Ovarlez, Jean-Philippe, Ren, Chengfang
We investigate the detection of weak complex-valued signals immersed in non-Gaussian, range-varying interference, with emphasis on maritime radar scenarios. The proposed methodology exploits a Complex-valued Variational AutoEncoder (CVAE) trained exclusively on clutter-plus-noise to perform Out-Of-Distribution detection. By operating directly on in-phase / quadrature samples, the CVAE preserves phase and Doppler structure and is assessed in two configurations: (i) using unprocessed range profiles and (ii) after local whitening, where per-range covariance estimates are obtained from neighboring profiles. Using extensive simulations together with real sea-clutter data from the CSIR maritime dataset, we benchmark performance against classical and adaptive detectors (MF, NMF, AMF-SCM, ANMF-SCM, ANMF-Tyler). In both configurations, the CVAE yields a higher detection probability Pd at matched false-alarm rate Pfa, with the most notable improvements observed under whitening. We further integrate the CVAE with the ANMF through a weighted log-p fusion rule at the decision level, attaining enhanced robustness in strongly non-Gaussian clutter and enabling empirically calibrated Pfa control under H0. Overall, the results demonstrate that statistical normalization combined with complex-valued generative modeling substantively improves detection in realistic sea-clutter conditions, and that the fused CVAE-ANMF scheme constitutes a competitive alternative to established model-based detectors.
- Europe > France (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Data Science (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
Invariance Co-training for Robot Visual Generalization
Yang, Jonathan, Finn, Chelsea, Sadigh, Dorsa
Abstract-- Reasoning from diverse observations is a fundamental capability for generalist robot policies to operate in a wide range of environments. Despite recent advancements, many large-scale robotic policies still remain sensitive to key sources of observational variation--such as changes in camera perspective, lighting, and the presence of distractor objects. We posit that the limited generalizability of these models arises from the substantial diversity required to robustly cover these quasistatic axes, coupled with the current scarcity of large-scale robotic datasets that exhibit rich variation across them. In this work, we propose to systematically examine what robots need to generalize across these challenging axes by introducing two key auxiliary tasks--state similarity and invariance to observational perturbations--applied to both demonstration data and static visual data. We then show that via these auxiliary tasks, leveraging both more-expensive robotic demonstration data and less-expensive, visually rich synthetic images generated from non-physics-based simulation (e.g., Unreal Engine) can lead to substantial increases in generalization to unseen camera viewpoints, lighting configurations, and distractor conditions. Our results demonstrate that co-training on this diverse data improves performance by 18% over existing generative augmentation methods. Robotic foundation models have shown impressive progress in generalizing to everyday scenarios by leveraging large-scale datasets spanning multiple embodiments, environments, and tasks [1], [2]. However, despite their breadth, the resulting models often remain brittle in real-world settings--failing to handle unseen spatial configurations of objects or adapt to drastic visual changes such as lighting and viewpoint shifts. We hypothesize that the brittleness of current robotic policies stems from insufficient coverage of key observational factors during training. For example, many large-scale datasets provide only one or two third-person perspectives per scene, limiting robustness to viewpoint shifts.
CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions
Kohaut, Simon, Ochs, Daniel, Zhang, Shun, Flade, Benedict, Eggert, Julian, Kersting, Kristian, Dhami, Devendra Singh
We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (5 more...)
Gentle Object Retraction in Dense Clutter Using Multimodal Force Sensing and Imitation Learning
Brouwer, Dane, Citron, Joshua, Nolte, Heather, Bohg, Jeannette, Cutkosky, Mark
Dense collections of movable objects are common in everyday spaces-from cabinets in a home to shelves in a warehouse. Safely retracting objects from such collections is difficult for robots, yet people do it frequently, leveraging learned experience in tandem with vision and non-prehensile tactile sensing on the sides and backs of their hands and arms. We investigate the role of contact force sensing for training robots to gently reach into constrained clutter and extract objects. The available sensing modalities are (1) "eye-in-hand" vision, (2) proprioception, (3) non-prehensile triaxial tactile sensing, (4) contact wrenches estimated from joint torques, and (5) a measure of object acquisition obtained by monitoring the vacuum line of a suction cup. We use imitation learning to train policies from a set of demonstrations on randomly generated scenes, then conduct an ablation study of wrench and tactile information. We evaluate each policy's performance across 40 unseen environment configurations. Policies employing any force sensing show fewer excessive force failures, an increased overall success rate, and faster completion times. The best performance is achieved using both tactile and wrench information, producing an 80% improvement above the baseline without force information.
Improving Robotic Manipulation Robustness via NICE Scene Surgery
Pakdamansavoji, Sajjad, Pourkeshavarz, Mozhgan, Sigal, Adam, Li, Zhiyuan, Yang, Rui Heng, Rasouli, Amir
Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.
Distracted Robot: How Visual Clutter Undermine Robotic Manipulation
Rasouli, Amir, Alban, Montgomery, Pakdamansavoji, Sajjad, Li, Zhiyuan, Zhang, Zhanguang, Wu, Aaron, Zhao, Xuan
In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.
Latent-space metrics for Complex-Valued VAE out-of-distribution detection under radar clutter
Rouzoumka, Y. A., Terreaux, E., Morisseau, C., Ovarlez, J. -P., Ren, C.
We therefore pursue a data-driven alternative based on complex-valued V AEs and latent-space OOD scores. In recent years, data-driven approaches have emerged to alleviate the need for precise clutter modeling. Among them, V AEs [4] have demonstrated promising capabilities for anomaly and OOD detection in diverse applications, including radar detection [5], speech enhancement [6], medical imaging [7], industrial monitoring [8], and acoustic signal analysis [9]. These models learn a latent representation of the training data and use reconstruction or probabilistic criteria to detect deviations. Despite their effectiveness, most V AE-based detectors operate in the real domain and often treat complex-valued radar data by separating real and imaginary components into distinct channels. Recent advances in Complex-V alued Neural Networks (CVNNs) have shown the benefits of directly modeling complex-valued signals [10, 11].
- Europe > France (0.05)
- North America > United States (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- (2 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Singapore (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
ClutterNav: Gradient-Guided Search for Efficient 3D Clutter Removal with Learned Costmaps
Ravie, Navin Sriram, M, Keerthi Vasan, Sebastian, Bijo
Dense clutter removal for target object retrieval presents a challenging problem, especially when targets are embedded deep within densely-packed configurations. It requires foresight to minimize overall changes to the clutter configuration while accessing target objects, avoiding stack destabilization and reducing the number of object removals required. Rule-based planners when applied to this problem, rely on rigid heuristics, leading to high computational overhead. End-to-end reinforcement learning approaches struggle with interpretability and generalizability over different conditions. To address these issues, we present ClutterNav, a novel decision-making framework that can identify the next best object to be removed so as to access a target object in a given clutter, while minimising stack disturbances. ClutterNav formulates the problem as a continuous reinforcement learning task, where each object removal dynamically updates the understanding of the scene. A removability critic, trained from demonstrations, estimates the cost of removing any given object based on geometric and spatial features. This learned cost is complemented by integrated gradients that assess how the presence or removal of surrounding objects influences the accessibility of the target. By dynamically prioritizing actions that balance immediate removability against long-term target exposure, ClutterNav achieves near human-like strategic sequencing, without predefined heuristics. The proposed approach is validated extensively in simulation and over real-world experiments. The results demonstrate real-time, occlusion-aware decision-making in partially observable environments.