The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Neural Information Processing Systems 

The surprising discovery of the BYOL method shows the negative samples can be replaced by adding the prediction head to the network. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations.