Dual Perspectives on Non-Contrastive Self-Supervised Learning
Ponce, Jean, Terver, Basile, Hebert, Martial, Arbel, Michael
–arXiv.org Artificial Intelligence
The stop gradient and exponential moving average iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they do not optimize the original objective, or any other smooth function, they do avoid collapse Following Tian et al. (2021), but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average always leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, asymptotically stable . Our theoretical findings are illustrated by empirical experiments with real and synthetic data. Self-supervised learning (or SSL) is an approach to representation learning that exploits the internal consistency of training data without requiring expensive annotations. However, non-contrastive approaches to SSL (Assran et al., 2023; Bardes et al., 2022) that take as input different views of the same data samples and learn to predict one view from the other, are susceptible to representational collapse where a constant embedding is learned for all data points (LeCun, 2022). We use in this presentation the dual viewpoints of optimization and dynamical systems to study theoretically and empirically the well-known stop gradient (Chen and He, 2021) and exponential moving average (Grill et al., 2020) training procedures that are specifically designed to avoid this problem. Here C is the global minimum of E (θ,ψ) (shown as negative instead of zero for readibility) associated with a collapse of the training process; B is a nontrivial local minimum one may reach using an appropriate regularization to avoid collapse; and A is a limit point of the stop gradient (SG) training procedure associated with parameters θ and ψ at convergence. In general, it is not a minimum of E and thus does not correspond to a collapse of the training process, but it is a minimum with respect to ψ of E ( θ,ψ).
arXiv.org Artificial Intelligence
Oct-15-2025
- Country:
- Genre:
- Research Report (0.64)
- Technology: