Gao, Junyu
Scale Efficient Training for Large Datasets
Zhou, Qing, Gao, Junyu, Wang, Qi
The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.
From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models
Dai, Muzhi, Sun, Jiashuo, Zhao, Zhiyuan, Liu, Shixuan, Li, Rui, Gao, Junyu, Li, Xuelong
Aligning large vision-language models (LVLMs) with human preferences is challenging due to the scarcity of fine-grained, high-quality, and multimodal preference data without human annotations. Existing methods relying on direct distillation often struggle with low-confidence data, leading to suboptimal performance. To address this, we propose CAREVL, a novel method for preference reward modeling by reliably using both high- and low-confidence data. First, a cluster of auxiliary expert models (textual reward models) innovatively leverages image captions as weak supervision signals to filter high-confidence data. The high-confidence data are then used to fine-tune the LVLM. Second, low-confidence data are used to generate diverse preference samples using the fine-tuned LVLM. These samples are then scored and selected to construct reliable chosen-rejected pairs for further training. CAREVL achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark, demonstrating its effectiveness. The code will be released soon.
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
Chen, Mengyuan, Gao, Junyu, Xu, Changsheng
A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels. In this paper, we theorize that enhancing performance requires expanding the semantic pool, while increasing the expected probability of selected OOD labels being activated by OOD samples, and ensuring low mutual dependence among the activations of these OOD labels. A natural expansion manner is to adopt a larger lexicon; however, the inevitable introduction of numerous synonyms and uncommon words fails to meet the above requirements, indicating that viable expansion manners move beyond merely selecting words from a lexicon. Since OOD detection aims to correctly classify input images into ID/OOD class groups, we can "make up" OOD label candidates which are not standard class names but beneficial for the process. Observing that the original semantic pool is comprised of unmodified specific class names, we correspondingly construct a conjugated semantic pool (CSP) consisting of modified superclass names, each serving as a cluster center for samples sharing similar properties across different categories. Consistent with our established theory, expanding OOD label candidates with the CSP satisfies the requirements and outperforms existing works by 7.89% in FPR95.
Revisiting Essential and Nonessential Settings of Evidential Deep Learning
Chen, Mengyuan, Gao, Junyu, Xu, Changsheng
Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation that provides reliable predictive uncertainty in a single forward pass, attracting significant attention. Grounded in subjective logic, EDL derives Dirichlet concentration parameters from neural networks to construct a Dirichlet probability density function (PDF), modeling the distribution of class probabilities. Despite its success, EDL incorporates several nonessential settings: In model construction, (1) a commonly ignored prior weight parameter is fixed to the number of classes, while its value actually impacts the balance between the proportion of evidence and its magnitude in deriving predictive scores. In model optimization, (2) the empirical risk features a variance-minimizing optimization term that biases the PDF towards a Dirac delta function, potentially exacerbating overconfidence. (3) Additionally, the structural risk typically includes a KL-divergence-minimizing regularization, whose optimization direction extends beyond the intended purpose and contradicts common sense, diminishing the information carried by the evidence magnitude. Therefore, we propose Re-EDL, a simplified yet more effective variant of EDL, by relaxing the nonessential settings and retaining the essential one, namely, the adoption of projected probability from subjective logic. Specifically, Re-EDL treats the prior weight as an adjustable hyperparameter rather than a fixed scalar, and directly optimizes the expectation of the Dirichlet PDF provided by deprecating both the variance-minimizing optimization term and the divergence regularization term. Extensive experiments and state-of-the-art performance validate the effectiveness of our method. The source code is available at https://github.com/MengyuanChen21/Re-EDL.
Text-only Synthesis for Image Captioning
Zhou, Qing, Huang, Junlin, Li, Qiang, Gao, Junyu, Wang, Qi
From paired image-text training to text-only training for image captioning, the pursuit of relaxing the requirements for high-cost and large-scale annotation of good quality data remains consistent. In this paper, we propose Text-only Synthesis for Image Captioning (ToCa), which further advances this relaxation with fewer human labor and less computing time. Specifically, we deconstruct caption text into structures and lexical words, which serve as the fundamental components of the caption. By combining different structures and lexical words as inputs to the large language model, massive captions that contain various patterns of lexical words are generated. This method not only approaches the target domain but also surpasses it by generating new captions, thereby enhancing the zero-shot generalization ability of the model. Considering the different levels of data access in the real world, we define three synthesis scenarios: cross-domain synthesis, in-domain synthesis, and data-efficient synthesis. Experiments in these scenarios demonstrate the generalizability, transferability and practicability of ToCa with a nearly 5 CIDEr improvement for zero-shot cross-domain captioning and a maximum increase of over 20 CIDEr for data-efficient captioning.
Like Humans to Few-Shot Learning through Knowledge Permeation of Vision and Text
Jia, Yuyu, Zhou, Qing, Huang, Wei, Gao, Junyu, Wang, Qi
Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we propose a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in a human intuition: A class name description offers a general representation, whereas an image captures the specificity of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop. Our code will be publicly available.
Learning Transferable Conceptual Prototypes for Interpretable Unsupervised Domain Adaptation
Gao, Junyu, Ma, Xinhong, Xu, Changsheng
Despite the great progress of unsupervised domain adaptation (UDA) with the deep neural networks, current UDA models are opaque and cannot provide promising explanations, limiting their applications in the scenarios that require safe and controllable model decisions. At present, a surge of work focuses on designing deep interpretable methods with adequate data annotations and only a few methods consider the distributional shift problem. Most existing interpretable UDA methods are post-hoc ones, which cannot facilitate the model learning process for performance enhancement. In this paper, we propose an inherently interpretable method, named Transferable Conceptual Prototype Learning (TCPL), which could simultaneously interpret and improve the processes of knowledge transfer and decision-making in UDA. To achieve this goal, we design a hierarchically prototypical module that transfers categorical basic concepts from the source domain to the target domain and learns domain-shared prototypes for explaining the underlying reasoning process. With the learned transferable prototypes, a self-predictive consistent pseudo-label strategy that fuses confidence, predictions, and prototype information, is designed for selecting suitable target samples for pseudo annotations and gradually narrowing down the domain gap. Comprehensive experiments show that the proposed method can not only provide effective and intuitive explanations but also outperform previous state-of-the-arts.
Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators
Bai, Sikai, Li, Shuaicheng, Zhuang, Weiming, Zhang, Jie, Guo, Song, Yang, Kunlin, Hou, Jun, Zhang, Shuai, Gao, Junyu, Yi, Shuai
Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure.} FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11% on CIFAR-10 and CINIC-10 datasets.
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification
Wang, Qi, Bai, Sikai, Gao, Junyu, Yuan, Yuan, Li, Xuelong
Noname manuscript No. (will be inserted by the editor) Abstract Person re-identification (re-ID) has gained more 1 Introduction and more attention due to its widespread applications in intelligent video surveillance. Unfortunately, the mainstream re-identification(Re-ID) aims at identifying images of the deep learning methods still need a large quantity of labeled same pedestrian across non-overlapping camera views in different data to train models, and annotating data is an expensive places, which has attracted a lot of research interests work in real-world scenarios. In addition, due to domain since the urgent demand for public safety and the increasing gaps between different datasets, the performance is dramatically number of surveillance cameras. Benefiting from the development decreased when re-ID models pre-trained on label-rich of deep learning (He et al., 2016; Szegedy et al., datasets (source domain) are directly applied to other unlabeled 2015) and the availability of labeled re-ID datasets Zheng datasets (target domain). In this paper, we attempt to et al. (2015), (Ristani et al., 2016; Wei et al., 2018; Ye et al., remedy these problems from two aspects, namely data and 2020), CNN-based re-ID methods (Ye et al., 2020; Yin et al., methodology. Firstly, we develop a data collector to automatically 2020; Zhu et al., 2021; Tao et al., 2013; Li et al., 2017), generate synthetic re-ID samples in a computer have made remarkable performance improvements in a supervised game, and construct a data labeler to simultaneously annotate manner. However, the aforementioned approaches them, which free humans from heavy data collections need a multitude of accurately labeled and diversified data to and annotations. Based on them, we build two synthetic person learn the discriminative features, and the current datasets are re-ID datasets with different scales, "GSPR" and "mini-not able to perfectly satisfy the demands in dataset scale or GSPR" datasets. Secondly, we propose a synthesis-based data diversity.
PCC Net: Perspective Crowd Counting via Spatial Convolutional Network
Gao, Junyu, Wang, Qi, Li, Xuelong
Crowd counting from a single image is a challenging task due to high appearance similarity, perspective changes and severe congestion. Many methods only focus on the local appearance features and they cannot handle the aforementioned challenges. In order to tackle them, we propose a Perspective Crowd Counting Network (PCC Net), which consists of three parts: 1) Density Map Estimation (DME) focuses on learning very local features for density map estimation; 2) Random High-level Density Classification (R-HDC) extracts global features to predict the coarse density labels of random patches in images; 3) Fore-/Background Segmentation (FBS) encodes mid-level features to segments the foreground and background. Besides, the DULR module is embedded in PCC Net to encode the perspective changes on four directions (Down, Up, Left and Right). The proposed PCC Net is verified on five mainstream datasets, which achieves the state-of-the-art performance on the one and attains the competitive results on the other four datasets. The source code is available at https://github.com/gjy3035/PCC-Net.