Technology
Tight Bounds for the VC-Dimension of Piecewise Polynomial Networks
ASakurai@jaist.ac.jp Abstract O(ws(s log d log(dqh/ s))) and O(ws((h/ s) log q) log(dqh/s)) are upper bounds for the VC-dimension of a set of neural networks of units with piecewise polynomial activation functions, where s is the depth of the network, h is the number of hidden units, w is the number of adjustable parameters, q is the maximum of the number of polynomial segments of the activation function, and d is the maximum degree of the polynomials; also n(wslog(dqh/s)) is a lower bound for the VC-dimension of such a network set, which are tight for the cases s 8(h) and s is constant. For the special case q 1, the VC-dimension is 8(ws log d). 1 Introduction In spite of its importance, we had been unable to obtain VC-dimension values for practical types of networks, until fairly tight upper and lower bounds were obtained ([6], [8], [9], and [10]) for linear threshold element networks in which all elements perform a threshold function on weighted sum of inputs. This is mainly because the differentiability ofthe functions is needed to perform backpropagation or other learning algorithms. Unfortunately explicit bounds obtained so far for the VC-dimension of sigmoidal networks exhibit large gaps (O(w2h2) ([3]), n(w log h) for bounded depth 324 A.Sakurai and f!(wh) for unbounded depth) and are hard to improve. For the piecewise linear case, Maass obtained a result that the VO-dimension is O(w210g q), where q is the number of linear pieces of the function ([5]).
Mean Field Methods for Classification with Gaussian Processes
We discuss the application of TAP mean field methods known from the Statistical Mechanics of disordered systems to Bayesian classification modelswith Gaussian processes. In contrast to previous approaches, noknowledge about the distribution of inputs is needed. Simulation results for the Sonar data set are given. They have been recently introduced into the Neural Computation community (Neal 1996, Williams & Rasmussen 1996, Mackay 1997). If we assume fields with zero prior mean, the statistics of h is entirely defined by the second order correlations C(s, S') E[h(s)h(S')], where E denotes expectations 310 MOpper and 0. Winther with respect to the prior. Interesting examples are C(s, s') (1) C(s, s') (2) The choice (1) can be motivated as a limit of a two-layered neural network with infinitely many hidden units with factorizable input-hidden weight priors (Williams 1997).
General Bounds on Bayes Errors for Regression with Gaussian Processes
Opper, Manfred, Vivarelli, Francesco
Based on a simple convexity lemma, we develop bounds for different typesof Bayesian prediction errors for regression with Gaussian processes. The basic bounds are formulated for a fixed training set. Simpler expressions are obtained for sampling from an input distribution whichequals the weight function of the covariance kernel, yielding asymptotically tight results. The results are compared with numerical experiments.
On the Optimality of Incremental Neural Network Algorithms
We study the approximation of functions by two-layer feedforward neural networks,focusing on incremental algorithms which greedily add units, estimating single unit parameters at each stage. As opposed to standard algorithms for fixed architectures, the optimization at each stage is performed over a small number of parameters, mitigating many of the difficult numerical problems inherent in high-dimensional nonlinear optimization. Weestablish upper bounds on the error incurred by the algorithm, when approximating functions from the Sobolev class, thereby extending previous results which only provided rates of convergence for functions in certain convex hulls of functional spaces. By comparing our results to recently derived lower bounds, we show that the greedy algorithms arenearly optimal. Combined with estimation error results for greedy algorithms, a strong case can be made for this type of approach.
Direct Optimization of Margins Improves Generalization in Combined Classifiers
Mason, Llew, Bartlett, Peter L., Baxter, Jonathan
The dark curve is AdaBoost, the light curve is DOOM. DOOM sacrifices significant training error forimproved test error (horizontal markson margin 0 line)_ 1 Introduction Many learning algorithms for pattern classification minimize some cost function of the training data, with the aim of minimizing error (the probability of misclassifying an example). One example of such a cost function is simply the classifier's error on the training data.
Optimizing Classifers for Imbalanced Training Sets
Karakoulas, Grigoris I., Shawe-Taylor, John
Following recent results [9, 8] showing the importance of the fatshattering dimensionin explaining the beneficial effect of a large margin on generalization performance, the current paper investigates theimplications of these results for the case of imbalanced datasets and develops two approaches to setting the threshold. The approaches are incorporated into ThetaBoost, a boosting algorithm fordealing with unequal loss functions. The performance of ThetaBoost and the two approaches are tested experimentally.
Unsupervised and Supervised Clustering: The Mutual Information between Parameters and Observations
Herschkowitz, Didier, Nadal, Jean-Pierre
Recent works in parameter estimation and neural coding have demonstrated that optimal performance are related to the mutual information between parameters and data. We consider the mutual information in the case where the dependency in the parameter (a vector 8) of the conditional p.d.f. of each observation (a vector
Linear Hinge Loss and Average Margin
Gentile, Claudio, Warmuth, Manfred K.
We describe a unifying method for proving relative loss bounds for online linearthreshold classification algorithms, such as the Perceptron and the Winnow algorithms. For classification problems the discrete loss is used, i.e., the total number of prediction mistakes. We introduce a continuous lossfunction, called the "linear hinge loss", that can be employed to derive the updates of the algorithms. We first prove bounds w.r.t. the linear hinge loss and then convert them to the discrete loss. We introduce anotion of "average margin" of a set of examples . We show how relative loss bounds based on the linear hinge loss can be converted to relative loss bounds i.t.o. the discrete loss using the average margin.