Goto

Collaborating Authors

 image-level label


Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

Neural Information Processing Systems

A simple and effective way to improve long-tailed object detection (L TOD) is to use extra data to increase the training samples for tail classes. However, collecting bounding box annotations, especially for rare categories, is costly and tedious. Therefore, previous studies resort to datasets with image-level labels to enrich the amount of samples for rare classes by exploring image-level semantics (as shown in Figure 1 (a)). While appealing, directly learning from such data to benefit detection is challenging since they lack bounding box annotations that are essential for object detection.



Open-Vocabulary Object Detection via Language Hierarchy

Neural Information Processing Systems

Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels.However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i.e., image-levellabels do not convey precise object information.We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors.LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability.In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing.Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.


Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

Neural Information Processing Systems

A simple and effective way to improve long-tailed object detection (L TOD) is to use extra data to increase the training samples for tail classes. However, collecting bounding box annotations, especially for rare categories, is costly and tedious. Therefore, previous studies resort to datasets with image-level labels to enrich the amount of samples for rare classes by exploring image-level semantics (as shown in Figure 1 (a)). While appealing, directly learning from such data to benefit detection is challenging since they lack bounding box annotations that are essential for object detection.


Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

Ma, Jiaqi, Xie, Guo-Sen, Zhao, Fang, Li, Zechao

arXiv.org Artificial Intelligence

Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2\% improvement on Pascal-5\textsuperscript{i} and a 9.7\% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.




FMaMIL: Frequency-Driven Mamba Multi-Instance Learning for Weakly Supervised Lesion Segmentation in Medical Images

Cheng, Hangbei, Dong, Xiaorong, Liu, Xueyu, Zhang, Jianan, Ma, Xuetao, Wei, Mingqiang, Wang, Liansheng, Chen, Junxin, Wu, Yongfei

arXiv.org Artificial Intelligence

Accurate lesion segmentation in histopathology images is essential for diagnostic interpretation and quantitative analysis, yet it remains challenging due to the limited availability of costly pixel-level annotations. To address this, we propose FMaMIL, a novel two-stage framework for weakly supervised lesion segmentation based solely on image-level labels. In the first stage, a lightweight Mamba-based encoder is introduced to capture long-range dependencies across image patches under the MIL paradigm. To enhance spatial sensitivity and structural awareness, we design a learnable frequency-domain encoding module that supplements spatial-domain features with spectrum-based information. CAMs generated in this stage are used to guide segmentation training. In the second stage, we refine the initial pseudo labels via a CAM-guided soft-label supervision and a self-correction mechanism, enabling robust training even under label noise. Extensive experiments on both public and private histopathology datasets demonstrate that FMaMIL outperforms state-of-the-art weakly supervised methods without relying on pixel-level annotations, validating its effectiveness and potential for digital pathology applications.


Open-Vocabulary Object Detection via Language Hierarchy

Neural Information Processing Systems

Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels.However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i.e., image-levellabels do not convey precise object information.We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors.LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability.In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing.Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.


Completely Weakly Supervised Class-Incremental Learning for Semantic Segmentation

Kim, David Minkwan, Lee, Soeun, Kang, Byeongkeun

arXiv.org Artificial Intelligence

This work addresses the task of completely weakly supervised class-incremental learning for semantic segmentation to learn segmentation for both base and additional novel classes using only image-level labels. While class-incremental semantic segmentation (CISS) is crucial for handling diverse and newly emerging objects in the real world, traditional CISS methods require expensive pixel-level annotations for training. To overcome this limitation, partially weakly-supervised approaches have recently been proposed. However, to the best of our knowledge, this is the first work to introduce a completely weakly-supervised method for CISS. To achieve this, we propose to generate robust pseudo-labels by combining pseudo-labels from a localizer and a sequence of foundation models based on their uncertainty. Moreover, to mitigate catastrophic forgetting, we introduce an exemplar-guided data augmentation method that generates diverse images containing both previous and novel classes with guidance. Finally, we conduct experiments in three common experimental settings: 15-5 VOC, 10-10 VOC, and COCO-to-VOC, and in two scenarios: disjoint and overlap. The experimental results demonstrate that our completely weakly supervised method outperforms even partially weakly supervised methods in the 15-5 VOC and 10-10 VOC settings while achieving competitive accuracy in the COCO-to-VOC setting. Introduction Class-incremental semantic segmentation (CISS) has become a vital research topic in the computer vision and robotics communities [30, 2, 4], as it enables learning to segment objects of novel classes in addition to previously learned categories using newly provided data.