Improving Pre-Trained Self-Supervised Embeddings Through Effective Entropy Maximization
Chakraborty, Deep, LeCun, Yann, Rudner, Tim G. J., Learned-Miller, Erik
Self-supervised learning (SSL) methods are widely employed for pre-training features on unlabeled data and are highly effective for subsequent fine-tuning on a wide variety of downstream tasks [Che+20; Gri+20; Car+20; BPL21]. In this paper, we ask whether it is possible to formulate a well-motivated, general-purpose criterion that allows further improving already-trained, highly-optimized SSL embeddings with only a handful of epochs of continued pre-training. Like several previous works [BJ17; WI20; Liu+22; Ozs+22], we start with the principle of maximizing the entropy of embeddings. One well-known motivation for this is that for a discrete embedding space, maximizing the entropy of a deterministic mapping preserves as much information as possible about the inputs. That is, such a maximum-entropy embedding maximizes the mutual information between the embedding and the input distribution [see, for example, Hje+18]. Similar results hold for continuous embeddings under appropriate noise models [see, for example, discussion of the Gaussian channel in CT91]. By maximizing the amount of information retained, one hopes to prepare as well as possible for future, as-yet-unknown, discrimination tasks. Our contribution is thus not the maximization of embedding entropy, but rather how we go about it.
Nov-24-2024