Goto

Collaborating Authors

 detection and segmentation


MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation

arXiv.org Artificial Intelligence

In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.


MRI Brain Tumor Detection with Computer Vision

arXiv.org Artificial Intelligence

This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U-Net for semantic segmentation and EfficientDet for anchor-based object detection to enhance the localization and identification of tumors. Our results demonstrate promising improvements in the accuracy and efficiency of brain tumor diagnostics, underscoring the potential of deep learning in medical imaging and its significance in improving clinical outcomes.


Real-Time Threaded Houbara Detection and Segmentation for Wildlife Conservation using Mobile Platforms

arXiv.org Artificial Intelligence

Real-time animal detection and segmentation in natural environments are vital for wildlife conservation, enabling non-invasive monitoring through remote camera streams. However, these tasks remain challenging due to limited computational resources and the cryptic appearance of many species. We propose a mobile-optimized two-stage deep learning framework that integrates a Threading Detection Model (TDM) to parallelize YOLOv10-based detection and MobileSAM-based segmentation. Unlike prior YOLO+SAM pipelines, our approach improves real-time performance by reducing latency through threading. YOLOv10 handles detection while MobileSAM performs lightweight segmentation, both executed concurrently for efficient resource use. On the cryptic Houbara Bustard, a conservation-priority species, our model achieves mAP50 of 0.9627, mAP75 of 0.7731, mAP95 of 0.7178, and a MobileSAM mIoU of 0.7421. YOLOv10 operates at 43.7 ms per frame, confirming real-time readiness. We introduce a curated Houbara dataset of 40,000 annotated images to support model training and evaluation across diverse conditions. The code and dataset used in this study are publicly available on GitHub at https://github.com/LyesSaadSaoud/mobile-houbara-detseg. For interactive demos and additional resources, visit https://lyessaadsaoud.github.io/LyesSaadSaoud-Threaded-YOLO-SAM-Houbara.


Weakly Supervised Intracranial Aneurysm Detection and Segmentation in MR angiography via Multi-task UNet with Vesselness Prior

arXiv.org Artificial Intelligence

Intracranial aneurysms (IAs) are abnormal dilations of cerebral blood vessels that, if ruptured, can lead to life-threatening consequences. However, their small size and soft contrast in radiological scans often make it difficult to perform accurate and efficient detection and morphological analyses, which are critical in the clinical care of the disorder . Furthermore, the lack of large public datasets with voxel-wise expert annotations pose challenges for developing deep learning algorithms to address the issues. Therefore, we proposed a novel weakly supervised 3D multi-task UNet that integrates vesselness priors to jointly perform aneurysm detection and segmentation in time-of-flight MR angiography (TOF-MRA). Specifically, to robustly guide IA detection and segmentation, we employ the popular Frangi's vesselness filter to derive soft cerebrovascular priors for both network input and an attention block to conduct segmentation from the decoder and detection from an auxiliary branch. W e train our model on the Lausanne dataset with coarse ground truth segmentation, and evaluate it on the test set with refined labels from the same database. T o further assess our model's generalizability, we also validate it externally on the ADAM dataset. Our results demonstrate the superior performance of the proposed technique over the SOTA techniques for aneurysm segmentation (Dice = 0.614, 95%HD =1.38mm) and detection (false positive rate = 1.47, sensitivity = 92.9%).


A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation

Neural Information Processing Systems

Low-shot object counters estimate the number of objects in an image using few or no annotated exemplars. Objects are localized by matching them to prototypes, which are constructed by unsupervised image-wide object appearance aggregation.Due to potentially diverse object appearances, the existing approaches often lead to overgeneralization and false positive detections.Furthermore, the best-performing methods train object localization by a surrogate loss, that predicts a unit Gaussian at each object center. This loss is sensitive to annotation error, hyperparameters and does not directly optimize the detection task, leading to suboptimal counts.We introduce GeCo, a novel low-shot counter that achieves accurate object detection, segmentation, and count estimation in a unified architecture.GeCo robustly generalizes the prototypes across objects appearances through a novel dense object query formulation. In addition, a novel counting loss is proposed, that directly optimizes the detection task and avoids the issues of the standard surrogate loss. GeCo surpasses the leading few-shot detection-based counters by \sim 25\% in the total count MAE, achieves superior detection accuracy and sets a new solid state-of-the-art result across all low-shot counting setups.


Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation

arXiv.org Artificial Intelligence

Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at https://github.com/mona4399/FeatureMixing.


PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

arXiv.org Artificial Intelligence

Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.


ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And Segmentation Of GI Bleeding

arXiv.org Artificial Intelligence

This study presents an integrated deep learning model for automatic detection and classification of Gastrointestinal bleeding in the frames extracted from Wireless Capsule Endoscopy (WCE) videos. The dataset has been released as part of Auto-WCBleedGen Challenge Version V2 hosted by the MISAHUB team. Our model attained the highest performance among 75 teams that took part in this competition. It aims to efficiently utilizes CNN based model i.e. DenseNet and UNet to detect and segment bleeding and non-bleeding areas in the real-world complex dataset. The model achieves an impressive overall accuracy of 80% which would surely help a skilled doctor to carry out further diagnostics.


Deep Learning for Surgical Instrument Recognition and Segmentation in Robotic-Assisted Surgeries: A Systematic Review

arXiv.org Artificial Intelligence

Applying deep learning (DL) for annotating surgical instruments in robot-assisted minimally invasive surgeries (MIS) represents a significant advancement in surgical technology. This systematic review examines 48 studies that and advanced DL methods and architectures. These sophisticated DL models have shown notable improvements in the precision and efficiency of detecting and segmenting surgical tools. The enhanced capabilities of these models support various clinical applications, including real-time intraoperative guidance, comprehensive postoperative evaluations, and objective assessments of surgical skills. By accurately identifying and segmenting surgical instruments in video data, DL models provide detailed feedback to surgeons, thereby improving surgical outcomes and reducing complication risks. Furthermore, the application of DL in surgical education is transformative. The review underscores the significant impact of DL on improving the accuracy of skill assessments and the overall quality of surgical training programs. However, implementing DL in surgical tool detection and segmentation faces challenges, such as the need for large, accurately annotated datasets to train these models effectively. The manual annotation process is labor-intensive and time-consuming, posing a significant bottleneck. Future research should focus on automating the detection and segmentation process and enhancing the robustness of DL models against environmental variations. Expanding the application of DL models across various surgical specialties will be essential to fully realize this technology's potential. Integrating DL with other emerging technologies, such as augmented reality (AR), also offers promising opportunities to further enhance the precision and efficacy of surgical procedures.


Raspberry PhenoSet: A Phenology-based Dataset for Automated Growth Detection and Yield Estimation

arXiv.org Artificial Intelligence

The future of the agriculture industry is intertwined with automation. Accurate fruit detection, yield estimation, and harvest time estimation are crucial for optimizing agricultural practices. These tasks can be carried out by robots to reduce labour costs and improve the efficiency of the process. To do so, deep learning models should be trained to perform knowledge-based tasks, which outlines the importance of contributing valuable data to the literature. In this paper, we introduce Raspberry PhenoSet, a phenology-based dataset designed for detecting and segmenting raspberry fruit across seven developmental stages. To the best of our knowledge, Raspberry PhenoSet is the first fruit dataset to integrate biology-based classification with fruit detection tasks, offering valuable insights for yield estimation and precise harvest timing. This dataset contains 1,853 high-resolution images, the highest quality in the literature, captured under controlled artificial lighting in a vertical farm. The dataset has a total of 6,907 instances of mask annotations, manually labelled to reflect the seven phenology stages. We have also benchmarked Raspberry PhenoSet using several state-of-the-art deep learning models, including YOLOv8, YOLOv10, RT-DETR, and Mask R-CNN, to provide a comprehensive evaluation of their performance on the dataset. Our results highlight the challenges of distinguishing subtle phenology stages and underscore the potential of Raspberry PhenoSet for both deep learning model development and practical robotic applications in agriculture, particularly in yield prediction and supply chain management. The dataset and the trained models are publicly available for future studies.