Goto

Collaborating Authors

 semi-vit


A Implementation Details

Neural Information Processing Systems

Dataset Sampling We sample 1% (10%) images per class from the datasets we use for the semi-supervised learning experiments of 1% (10%) labels. The final Semi-ViT -Small is with DINO self-pretraining. Supervised Fine-tuning Settings The settings for the stage of supervised fine-tuning, with and without self-pretraining, are shown in Table 12. Computing Resources We run all experiments on V100 GPUs of 32G memory. Random Seeds and Error Bar Since some of the experiments are expensive to run, e.g., Semi-ViT -Huge.


Semi-supervised Vision Transformers at Scale

Neural Information Processing Systems

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architecture to different tasks.


Semi-supervised Vision Transformers at Scale

Neural Information Processing Systems

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architecture to different tasks.


Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels

You, Zebin, Zhong, Yong, Bao, Fan, Sun, Jiacheng, Li, Chongxuan, Zhu, Jun

arXiv.org Artificial Intelligence

In an effort to further advance semi-supervised generative and classification tasks, we propose a simple yet effective training strategy called dual pseudo training (DPT), built upon strong semi-supervised learners and diffusion models. DPT operates in three stages: training a classifier on partially labeled data to predict pseudo-labels; training a conditional generative model using these pseudo-labels to generate pseudo images; and retraining the classifier with a mix of real and pseudo images. Empirically, DPT consistently achieves SOTA performance of semi-supervised generation and classification across various settings. In particular, with one or two labels per class, DPT achieves a Fr\'echet Inception Distance (FID) score of 3.08 or 2.52 on ImageNet 256x256. Besides, DPT outperforms competitive semi-supervised baselines substantially on ImageNet classification tasks, achieving top-1 accuracies of 59.0 (+2.8), 69.5 (+3.0), and 74.4 (+2.0) with one, two, or five labels per class, respectively. Notably, our results demonstrate that diffusion can generate realistic images with only a few labels (e.g., <0.1%) and generative augmentation remains viable for semi-supervised classification. Our code is available at https://github.com/ML-GSAI/DPT.


Transfer Learning for Fine-grained Classification Using Semi-supervised Learning and Visual Transformers

Lagunas, Manuel, Impata, Brayan, Martinez, Victor, Fernandez, Virginia, Georgakis, Christos, Braun, Sofia, Bertrand, Felipe

arXiv.org Artificial Intelligence

Fine-grained classification is a challenging task that involves identifying subtle differences between objects within the same category. This task is particularly challenging in scenarios where data is scarce. Visual transformers (ViT) have recently emerged as a powerful tool for image classification, due to their ability to learn highly expressive representations of visual data using self-attention mechanisms. In this work, we explore Semi-ViT, a ViT model fine tuned using semi-supervised learning techniques, suitable for situations where we have lack of annotated data. This is particularly common in e-commerce, where images are readily available but labels are noisy, nonexistent, or expensive to obtain. Our results demonstrate that Semi-ViT outperforms traditional convolutional neural networks (CNN) and ViTs, even when fine-tuned with limited annotated data. These findings indicate that Semi-ViTs hold significant promise for applications that require precise and fine-grained classification of visual data.


Semi-supervised Vision Transformers at Scale

Cai, Zhaowei, Ravichandran, Avinash, Favaro, Paolo, Wang, Manchen, Modolo, Davide, Bhotika, Rahul, Tu, Zhuowen, Soatto, Stefano

arXiv.org Artificial Intelligence

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting. Semi-ViT also enjoys the scalability benefits of ViTs that can be readily scaled up to large-size models with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive 80% top-1 accuracy on ImageNet using only 1% labels, which is comparable with Inception-v4 using 100% ImageNet labels.