Hard View Selection for Self-Supervised Learning

Ferreira, Fabio, Rapant, Ivo, Hutter, Frank

arXiv.org Artificial Intelligence 

Many Self-Supervised Learning (SSL) methods train their models to be invariant to different "views" of an image and considerable efforts were directed towards improving pre-text tasks, architectures, or robustness. However, most SSL methods remain reliant on the random sampling of operations within the image augmentation pipeline, such as the random resized crop operation. We argue that the role of the view generation and its effect on performance has so far received insufficient attention. To address this, we propose an easy, learning-free, yet powerful Hard View Selection (HVS) strategy designed to extend the random view generation to expose the pretrained model to harder samples during SSL training. It encompasses the following iterative steps: 1) randomly sample multiple views and create pairs of two views, 2) run forward passes for each view pair on the currently trained model, 3) adversarially select the pair yielding the worst loss depending on the current model state, and 4) run the backward pass with the selected pair. As a result, HVS consistently achieves accuracy improvements between 0.91% and 1.93% on ImageNet linear evaluation and similar improvements on transfer tasks across DINO, SimSiam, iBOT and SimCLR. We provide studies to shed light on the inner workings and show that, by naively using smaller resolution images for the selection step, we can significantly reduce the computational overhead while retaining performance. Surprisingly, even when accounting for the computational overhead incurred by HVS, we achieve performance gains between 0.52% and 1.02% and closely rival the 800-epoch DINO pretraining with only 300 epochs. Various approaches to learn effective and generalizable visual representations in Self-Supervised Learning (SSL) exist. Such views are generated by applying a sequence of (randomly sampled) image transformations and are usually composed of geometric (cropping, rotation, etc.) and appearance (color distortion, blurring, etc.) transformations. A body of literature (Chen et al., 2020a; Wu et al., 2020; Purushwalkam & Gupta, 2020; Wagner et al., 2022; Tian et al., 2020b) has illuminated the effects of image views on representation learning and identified random resized crop (RRC) transformation, which However, despite this finding and to our best knowledge, little research has gone into identifying more effective ways for selecting or generating views to improve performance.