Country
PAC-Bayes & Margins
Langford, John, Shawe-Taylor, John
There are two mathematical flavors of margin bound dependent upon the weights Wi of the vote and the features Xi that the vote is taken over. The results here are of the "bll2" form. We improve on Shawe-Taylor et al. [12] and Bartlett [1] by a log(m)2 sample complexity factor and much tighter constants (1000 or unstated versus 9 or 18 as suggested by Section 2.2). In addition, the bound here covers margin errors without weakening the error-free case. Herbrich and Graepel [3] moved significantly towards the approach adopted in our paper, but the methodology adopted meant that their result does not scale well to high dimensional feature spaces as the bound here (and earlier results) do.
Conditional Models on the Ranking Poset
Lebanon, Guy, Lafferty, John D.
A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.
An Impossibility Theorem for Clustering
Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) tradeoffs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.
A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
Allender, Eric, Arora, Sanjeev, Kearns, Michael, Moore, Cristopher, Russell, Alexander
We establish a new hardness result that shows that the difficulty of planning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of factored MDPs with linear rewards whose optimal policies and value functions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connections with the complexity class of Arthur-Merlin games.
On the Complexity of Learning the Kernel Matrix
Bousquet, Olivier, Herrmann, Daniel
We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.
The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities
Watanabe, Sumio, Amari, Shun-ichi
A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information.
Scaling of Probability-Based Optimization Algorithms
Population-based Incremental Learning is shown require very sensitive scaling of its learning rate. The learning rate must scale with the system size in a problem-dependent way. This is shown in two problems: the needle-in-a haystack, in which the learning rate must vanish exponentially in the system size, and in a smooth function in which the learning rate must vanish like the square root of the system size. Two methods are proposed for removing this sensitivity. A learning dynamics which obeys detailed balance is shown to give consistent performance over the entire range of learning rates. An analog of mutation is shown to require a learning rate which scales as the inverse system size, but is problem independent.
Information Diffusion Kernels
Lebanon, Guy, Lafferty, John D.
A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with nonparametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
Dyadic Classification Trees via Structural Risk Minimization
Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.