Collins, Edo, Bigdeli, Siavash Arjomand, Süsstrunk, Sabine

We propose a new notion of'non-linearity' of a network layer with respect to an input batch that is based on its proximity to a linear system, which is reflected in the nonnegative rank of the activation matrix. Considering batches of similar samples, we find that high non-linearity in deep layers is indicative of memorization. Furthermore, by applying our approach layer-by-layer, we find that the mechanism for memorization consists of distinct phases. We perform experiments on fully-connected and convolutional neural networks trained on several image and audio datasets. Our results demonstrate that as an indicator for memorization, our technique can be used to perform early stopping. A fundamental challenge in machine learning is balancing the bias-variance tradeoff, where overly simple learning models underfit the data (suboptimal performance on the training data) and overly complex models are expected to overfit or memorize the data (perfect training set performance, but suboptimal test set performance). The latter direction of this tradeoff has come into question with the observation that deep neural networks do not memorize their training data despite having sufficient capacity to do so (Zhang et al., 2016), the explanation of which is a matter of much interest.

Deep neural networks (DNNs) can easily fit a random labeling of the training data with zero training error. What is the difference between DNNs trained with random labels and the ones trained with true labels? Our paper answers this question with two contributions. First, we study the memorization properties of DNNs. Our empirical experiments shed light on how DNNs prioritize the learning of simple input patterns. In the second part, we propose to measure the similarity between what different DNNs have learned and memorized. With the proposed approach, we analyze and compare DNNs trained on data with true labels and random labels. The analysis shows that DNNs have \textit{One way to Learn} and \textit{N ways to Memorize}. We also use gradient information to gain an understanding of the analysis results.

Cohen, Gilad, Sapiro, Guillermo, Giryes, Raja

This paper studies the relationship between the classification performed by deep neural networks and the $k$-NN decision at the embedding space of these networks. This simple important connection shown here provides a better understanding of the relationship between the ability of neural networks to generalize and their tendency to memorize the training data, which are traditionally considered to be contradicting to each other and here shown to be compatible and complementary. Our results support the conjecture that deep neural networks approach Bayes optimal error rates.

Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N., Lopez-Paz, David

Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples. In this work, we propose mixup, a simple learning principle to alleviate these issues. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples. Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures. We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

The roles played by learning and memorization represent an important topic in deep learning research. Recent work on this subject has shown that the optimization behavior of DNNs trained on shuffled labels is qualitatively different from DNNs trained with real labels. Here, we propose a novel permutation approach that can differentiate memorization from learning in deep neural networks (DNNs) trained as usual (i.e., using the real labels to guide the learning, rather than shuffled labels). The evaluation of weather the DNN has learned and/or memorized, happens in a separate step where we compare the predictive performance of a shallow classifier trained with the features learned by the DNN, against multiple instances of the same classifier, trained on the same input, but using shuffled labels as outputs. By evaluating these shallow classifiers in validation sets that share structure with the training set, we are able to tell apart learning from memorization. Application of our permutation approach to multi-layer perceptrons and convolutional neural networks trained on image data corroborated many findings from other groups. Most importantly, our illustrations also uncovered interesting dynamic patterns about how DNNs memorize over increasing numbers of training epochs, and support the surprising result that DNNs are still able to learn, rather than only memorize, when trained with pure Gaussian noise as input.