Goto

Collaborating Authors

 Wang, Changfeng


Temporal Dynamics of Generalization in Neural Networks

Neural Information Processing Systems

This paper presents a rigorous characterization of how a general nonlinear learning machine generalizes during the training process when it is trained on a random sample using a gradient descent algorithm based on reduction of training error. It is shown, in particular, that best generalization performance occurs, in general, before the global minimum of the training error is achieved. The different roles played by the complexity of the machine class and the complexity of the specific machine in the class during learning are also precisely demarcated. 1 INTRODUCTION In learning machines such as neural networks, two major factors that affect the'goodness of fit' of the examples are network size (complexity) and training time. These are also the major factors that affect the generalization performance of the network. Many theoretical studies exploring the relation between generalization performance and machine complexity support the parsimony heuristics suggested by Occam's razor, to wit that amongst machines with similar training performance one should opt for the machine of least complexity.


Temporal Dynamics of Generalization in Neural Networks

Neural Information Processing Systems

This paper presents a rigorous characterization of how a general nonlinear learning machine generalizes during the training process when it is trained on a random sample using a gradient descent algorithm based on reduction of training error. It is shown, in particular, that best generalization performance occurs, in general, before the global minimum of the training error is achieved. The different roles played by the complexity of the machine class and the complexity of the specific machine in the class during learning are also precisely demarcated. 1 INTRODUCTION In learning machines such as neural networks, two major factors that affect the'goodness of fit' of the examples are network size (complexity) and training time. These are also the major factors that affect the generalization performance of the network. Many theoretical studies exploring the relation between generalization performance and machine complexity support the parsimony heuristics suggested by Occam's razor, to wit that amongst machines with similar training performance one should opt for the machine of least complexity.


Temporal Dynamics of Generalization in Neural Networks

Neural Information Processing Systems

This paper presents a rigorous characterization of how a general nonlinear learning machine generalizes during the training process when it is trained on a random sample using a gradient descent algorithm based on reduction of training error. It is shown, in particular, that best generalization performance occurs, in general, before the global minimum of the training error is achieved. The different roles played by the complexity of the machine class and the complexity of the specific machine in the class during learning are also precisely demarcated. 1 INTRODUCTION In learning machines such as neural networks, two major factors that affect the'goodness of fit' of the examples are network size (complexity) and training time. These are also the major factors that affect the generalization performance of the network. Many theoretical studies exploring the relation between generalization performance and machine complexity support the parsimony heuristics suggested by Occam's razor, towit that amongst machines with similar training performance one should opt for the machine of least complexity.


Optimal Stopping and Effective Machine Complexity in Learning

Neural Information Processing Systems

We study tltt' problem of when to stop If'arning a class of feedforward networks - networks with linear outputs I1PUrOIl and fixed input weights - when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there a.re in general three distinct phases in the generalization performance in the learning process, and in particular, the network has hetter gt'neralization pPTformance when learning is stopped at a certain time before til(' global miniIl111lu of the empirical error is reachert. A notion of effective size of a machine is rtefil1e i and used to explain the tradeoff betwf'en the complexity of the marhine and the training error ill the learning process. The study leads nat.urally to a network size selection critt'rion, which turns Ol1t to be a generalization of Akaike's Information Criterioll for the It'arning process. It if; shown that stopping Iparning before tiJt' global minimum of the empirical error has the effect of network size splectioll. 1 INTRODUCTION The primary goal of learning in neural nets is to find a network that gives valid generalization. In achieving this goal, a central issue is the tradeoff between the training error and network complexity. This usually reduces to a problem of network size selection, which has drawn much research effort in recent years. Various principles, theories, and intuitions, including Occam's razor, statistical model selection criteria such as Akaike's Information Criterion (AIC) [11 and many others [5, 1, 10,3,111 all quantitatively support the following PAC prescription: between two machines which have the same empirical error, the machine with smaller VC-dimf'nsion generalizes better. However, it is noted that these methods or criteria do not npcpssarily If'ad to optimal (or llearly optimal) generalization performance.


Optimal Stopping and Effective Machine Complexity in Learning

Neural Information Processing Systems

We study tltt' problem of when to stop If'arning a class of feedforward networks - networks with linear outputs I1PUrOIl and fixed input weights - when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown that there a.re in general three distinct phases in the generalization performance in the learning process, and in particular, the network has hetter gt'neralization pPTformance when learning is stopped at a certain time before til(' global miniIl111lu of the empirical error is reachert. A notion of effective size of a machine is rtefil1e i and used to explain the tradeoff betwf'en the complexity of the marhine and the training error ill the learning process. The study leads nat.urally to a network size selection critt'rion, which turns Ol1t to be a generalization of Akaike's Information Criterioll for the It'arning process. It if; shown that stopping Iparning before tiJt' global minimum of the empirical error has the effect of network size splectioll. 1 INTRODUCTION The primary goal of learning in neural nets is to find a network that gives valid generalization. In achieving this goal, a central issue is the tradeoff between the training error and network complexity. This usually reduces to a problem of network size selection, which has drawn much research effort in recent years. Various principles, theories, and intuitions, including Occam's razor, statistical model selection criteria such as Akaike's Information Criterion (AIC) [11 and many others [5, 1, 10,3,111 all quantitatively support the following PAC prescription: between two machines which have the same empirical error, the machine with smaller VC-dimf'nsion generalizes better. However, it is noted that these methods or criteria do not npcpssarily If'ad to optimal (or llearly optimal) generalization performance.