effective size
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models
Bhaskar, Adithya, Friedman, Dan, Chen, Danqi
Prior work has found that pretrained language models (LMs) fine-tuned with different random seeds can achieve similar in-domain performance but generalize differently on tests of syntactic generalization. In this work, we show that, even within a single model, we can find multiple subnetworks that perform similarly in-domain, but generalize vastly differently. To better understand these phenomena, we investigate if they can be understood in terms of "competing subnetworks": the model initially represents a variety of distinct algorithms, corresponding to different subnetworks, and generalization occurs when it ultimately converges to one. This explanation has been used to account for generalization in simple algorithmic tasks ("grokking"). Instead of finding competing subnetworks, we find that all subnetworks -- whether they generalize or not -- share a set of attention heads, which we refer to as the heuristic core. Further analysis suggests that these attention heads emerge early in training and compute shallow, non-generalizing features. The model learns to generalize by incorporating additional attention heads, which depend on the outputs of the "heuristic" heads to compute higher-level features. Overall, our results offer a more detailed picture of the mechanisms for syntactic generalization in pretrained LMs.
Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
Inferior temporal cortex (IT) neurons have large receptive fields when a single effective object stimulus is shown against a blank background, but have much smaller receptive fields when the object is placed in a natural scene. Thus, translation invariant object recognition is reduced in natural scenes, and this may help object selection. We describe a model which accounts for this by competition within an attractor in which the neurons are tuned to different objects in the scene, and the fovea has a higher cortical magnification factor than the peripheral visual field. Further- more, we show that top-down object bias can increase the receptive field size, facilitating object search in complex visual scenes, and providing a model of object-based attention. The model leads to the prediction that introduction of a second object into a scene with blank background will reduce the receptive field size to values that depend on the closeness of the second object to the target stimulus. We suggest that mechanisms of this type enable the output of IT to be primarily about one object, so that the areas that receive from IT can select the object as a potential target for action.
Consistent Batch Normalization for Weighted Loss in Imbalanced-Data Environment
Yasuda, Muneki, En, Yeo Xian, Ueno, Seishirou
In this study, we consider classification problems based on neural networks in a data-imbalanced environment. Learning from an imbalanced dataset is one of the most important and practical problems in the field of machine learning. A weighted loss function (WLF) based on a cost-sensitive approach is a well-known and effective method for imbalanced datasets. We consider a combination of WLF and batch normalization (BN) in this study. BN is considered as a powerful standard technique in the recent developments in deep learning. A simple combination of both methods leads to a size-inconsistency problem due to a mismatch between the interpretations of the effective size of the dataset in both methods. We propose a simple modification to BN, called weighted batch normalization (WBN), to correct the size-mismatch. The idea of WBN is simple and natural. Using numerical experiments, we demonstrate that our method is effective in a data-imbalanced environment.
Optimal Stopping and Effective Machine Complexity in Learning
Wang, Changfeng, Venkatesh, Santosh S., Judd, J. Stephen
We study tltt' problem of when to stop If'arning a class of feedforward networks - networks with linear outputs I1PUrOIl and fixed input weights - when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there a.re in general three distinct phases in the generalization performance in the learning process, and in particular, the network has hetter gt'neralization pPTformance when learning is stopped at a certain time before til(' global miniIl111lu of the empirical error is reachert. A notion of effective size of a machine is rtefil1e i and used to explain the tradeoff betwf'en the complexity of the marhine and the training error ill the learning process. The study leads nat.urally to a network size selection critt'rion, which turns Ol1t to be a generalization of Akaike's Information Criterioll for the It'arning process. It if; shown that stopping Iparning before tiJt' global minimum of the empirical error has the effect of network size splectioll. 1 INTRODUCTION The primary goal of learning in neural nets is to find a network that gives valid generalization. In achieving this goal, a central issue is the tradeoff between the training error and network complexity. This usually reduces to a problem of network size selection, which has drawn much research effort in recent years. Various principles, theories, and intuitions, including Occam's razor, statistical model selection criteria such as Akaike's Information Criterion (AIC) [11 and many others [5, 1, 10,3,111 all quantitatively support the following PAC prescription: between two machines which have the same empirical error, the machine with smaller VC-dimf'nsion generalizes better. However, it is noted that these methods or criteria do not npcpssarily If'ad to optimal (or llearly optimal) generalization performance.
Optimal Stopping and Effective Machine Complexity in Learning
Wang, Changfeng, Venkatesh, Santosh S., Judd, J. Stephen
We study tltt' problem of when to stop If'arning a class of feedforward networks - networks with linear outputs I1PUrOIl and fixed input weights - when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there a.re in general three distinct phases in the generalization performance in the learning process, and in particular, the network has hetter gt'neralization pPTformance when learning is stopped at a certain time before til(' global miniIl111lu of the empirical error is reachert. A notion of effective size of a machine is rtefil1e i and used to explain the tradeoff betwf'en the complexity of the marhine and the training error ill the learning process. The study leads nat.urally to a network size selection critt'rion, which turns Ol1t to be a generalization of Akaike's Information Criterioll for the It'arning process. It if; shown that stopping Iparning before tiJt' global minimum of the empirical error has the effect of network size splectioll. 1 INTRODUCTION The primary goal of learning in neural nets is to find a network that gives valid generalization. In achieving this goal, a central issue is the tradeoff between the training error and network complexity. This usually reduces to a problem of network size selection, which has drawn much research effort in recent years. Various principles, theories, and intuitions, including Occam's razor, statistical model selection criteria such as Akaike's Information Criterion (AIC) [11 and many others [5, 1, 10,3,111 all quantitatively support the following PAC prescription: between two machines which have the same empirical error, the machine with smaller VC-dimf'nsion generalizes better. However, it is noted that these methods or criteria do not npcpssarily If'ad to optimal (or llearly optimal) generalization performance.
Optimal Stopping and Effective Machine Complexity in Learning
Wang, Changfeng, Venkatesh, Santosh S., Judd, J. Stephen
We study tltt' problem of when to stop If'arning a class of feedforward networks - networks with linear outputs I1PUrOIl and fixed input weights - when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there a.re in general three distinct phases in the generalization performance in the learning process, and in particular, the network has hetter gt'neralization pPTformance when learning is stopped at a certain time before til(' global miniIl111lu of the empirical error is reachert. A notion of effective size of a machine is rtefil1e i and used to explain the tradeoff betwf'en the complexity of the marhine and the training error ill the learning process. The study leads nat.urally to a network size selection critt'rion, which turns Ol1t to be a generalization of Akaike's Information Criterioll for the It'arning process. It if; shown that stopping Iparning before tiJt' global minimum of the empirical error has the effect of network size splectioll. 1 INTRODUCTION The primary goal of learning in neural nets is to find a network that gives valid generalization. In achieving this goal, a central issue is the tradeoff between the training error and network complexity. This usually reduces to a problem of network size selection, which has drawn much research effort in recent years. Various principles, theories, and intuitions, including Occam's razor, statistical model selection criteria such as Akaike's Information Criterion (AIC) [11 and many others [5, 1, 10,3,111 all quantitatively support the following PAC prescription: between two machines which have the same empirical error, the machine with smaller VC-dimf'nsion generalizes better. However, it is noted that these methods or criteria do not npcpssarily If'ad to optimal (or llearly optimal) generalization performance.