Statistical Learning
Proper Complex Gaussian Processes for Regression
Boloix-Tortosa, Rafael, Payán-Somet, F. Javier, Arias-de-Reyna, Eva, Murillo-Fuentes, Juan José
Complex-valued signals are used in the modeling of many systems in engineering and science, hence being of fundamental interest. Often, random complex-valued signals are considered to be proper. A proper complex random variable or process is uncorrelated with its complex conjugate. This assumption is a good model of the underlying physics in many problems, and simplifies the computations. While linear processing and neural networks have been widely studied for these signals, the development of complex-valued nonlinear kernel approaches remains an open problem. In this paper we propose Gaussian processes for regression as a framework to develop 1) a solution for proper complex-valued kernel regression and 2) the design of the reproducing kernel for complex-valued inputs, using the convolutional approach for cross-covariances. In this design we pay attention to preserve, in the complex domain, the measure of similarity between near inputs. The hyperparameters of the kernel are learned maximizing the marginal likelihood using Wirtinger derivatives. Besides, the approach is connected to the multiple output learning scenario. In the experiments included, we first solve a proper complex Gaussian process where the cross-covariance does not cancel, a challenging scenario when dealing with proper complex signals. Then we successfully use these novel results to solve some problems previously proposed in the literature as benchmarks, reporting a remarkable improvement in the estimation error.
Avoiding Confusion between Predictors and Inhibitors in Value Function Approximation
Connor, Patrick C., Trappenberg, Thomas P.
In reinforcement learning, the goal is to seek rewards and avoid punishments. A simple scalar captures the value of a state or of taking an action, where expected future rewards increase and punishments decrease this quantity. Naturally an agent should learn to predict this quantity to take beneficial actions, and many value function approximators exist for this purpose. In the present work, however, we show how value function approximators can cause confusion between predictors of an outcome of one valence (e.g., a signal of reward) and the inhibitor of the opposite valence (e.g., a signal canceling expectation of punishment). We show this to be a problem for both linear and non-linear value function approximators, especially when the amount of data (or experience) is limited. We propose and evaluate a simple resolution: to instead predict reward and punishment values separately, and rectify and add them to get the value needed for decision making. We evaluate several function approximators in this slightly different value function approximation architecture and show that this approach is able to circumvent the confusion and thereby achieve lower value-prediction errors.
On the Inductive Bias of Dropout
Helmbold, David P., Long, Philip M.
Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show: (a) when the dropout-regularized criterion has a unique minimizer, (b) when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded, (c) that the dropout regularization can be non-monotonic as individual weights increase from 0, and (d) that the dropout regularization penalty may not be convex. This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function. In order to contrast dropout regularization with $L_2$ regularization, we formalize the notion of when different sources are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than $L_2$ regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and $L_2$ regularization differ. We provide some similar results for $L_1$ regularization.
Particle Gibbs for Bayesian Additive Regression Trees
Lakshminarayanan, Balaji, Roy, Daniel M., Teh, Yee Whye
Additive regression trees are flexible non-parametric models and popular off-the-shelf tools for real-world non-linear regression. In application domains, such as bioinformatics, where there is also demand for probabilistic predictions with measures of uncertainty, the Bayesian additive regression trees (BART) model, introduced by Chipman et al. (2010), is increasingly popular. As data sets have grown in size, however, the standard Metropolis-Hastings algorithms used to perform inference in BART are proving inadequate. In particular, these Markov chains make local changes to the trees and suffer from slow mixing when the data are high-dimensional or the best fitting trees are more than a few layers deep. We present a novel sampler for BART based on the Particle Gibbs (PG) algorithm (Andrieu et al., 2010) and a top-down particle filtering algorithm for Bayesian decision trees (Lakshminarayanan et al., 2013). Rather than making local changes to individual trees, the PG sampler proposes a complete tree to fit the residual. Experiments show that the PG sampler outperforms existing samplers in many settings.
On the Predictive Properties of Binary Link Functions
This paper provides a theoretical and computational justification of the long held claim that of the similarity of the probit and logit link functions often used in binary classification. Despite this widespread recognition of the strong similarities between these two link functions, very few (if any) researchers have dedicated time to carry out a formal study aimed at establishing and characterizing firmly all the aspects of the similarities and differences. This paper proposes a definition of both structural and predictive equivalence of link functions-based binary regression models, and explores the various ways in which they are either similar or dissimilar. From a predictive analytics perspective, it turns out that not only are probit and logit perfectly predictively concordant, but the other link functions like cauchit and complementary log log enjoy very high percentage of predictive equivalence. Throughout this paper, simulated and real life examples demonstrate all the equivalence results that we prove theoretically.
The Responsibility Weighted Mahalanobis Kernel for Semi-Supervised Training of Support Vector Machines for Classification
Reitmaier, Tobias, Sick, Bernhard
Kernel functions in support vector machines (SVM) are needed to assess the similarity of input samples in order to classify these samples, for instance. Besides standard kernels such as Gaussian (i.e., radial basis function, RBF) or polynomial kernels, there are also specific kernels tailored to consider structure in the data for similarity assessment. In this article, we will capture structure in data by means of probabilistic mixture density models, for example Gaussian mixtures in the case of real-valued input spaces. From the distance measures that are inherently contained in these models, e.g., Mahalanobis distances in the case of Gaussian mixtures, we derive a new kernel, the responsibility weighted Mahalanobis (RWM) kernel. Basically, this kernel emphasizes the influence of model components from which any two samples that are compared are assumed to originate (that is, the "responsible" model components). We will see that this kernel outperforms the RBF kernel and other kernels capturing structure in data (such as the LAP kernel in Laplacian SVM) in many applications where partially labeled data are available, i.e., for semi-supervised training of SVM. Other key advantages are that the RWM kernel can easily be used with standard SVM implementations and training algorithms such as sequential minimal optimization, and heuristics known for the parametrization of RBF kernels in a C-SVM can easily be transferred to this new kernel. Properties of the RWM kernel are demonstrated with 20 benchmark data sets and an increasing percentage of labeled samples in the training data.
Clustering by Descending to the Nearest Neighbor in the Delaunay Graph Space
In our previous works, we proposed a physically-inspired rule to organize the data points into an in-tree (IT) structure, in which some undesired edges are allowed to occur. By removing those undesired or redundant edges, this IT structure is divided into several separate parts, each representing one cluster. In this work, we seek to prevent the undesired edges from arising at the source. Before using the physically-inspired rule, data points are at first organized into a proximity graph which restricts each point to select the optimal directed neighbor just among its neighbors. Consequently, separated in-trees or clusters automatically arise, without redundant edges requiring to be removed.
Model Selection for Topic Models via Spectral Decomposition
Cheng, Dehua, He, Xinran, Liu, Yan
Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following recent advances in topic model inference via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. Under mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are accurate and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis for other latent models.
Cost-Sensitive Support Vector Machines
Masnadi-Shirazi, Hamed, Vasconcelos, Nuno, Iranmehr, Arya
A new procedure for learning cost-sensitive SVM(CS-SVM) classifiers is proposed. The SVM hinge loss is extended to the cost sensitive setting, and the CS-SVM is derived as the minimizer of the associated risk. The extension of the hinge loss draws on recent connections between risk minimization and probability elicitation. These connections are generalized to cost-sensitive classification, in a manner that guarantees consistency with the cost-sensitive Bayes risk, and associated Bayes decision rule. This ensures that optimal decision rules, under the new hinge loss, implement the Bayes-optimal cost-sensitive classification boundary. Minimization of the new hinge loss is shown to be a generalization of the classic SVM optimization problem, and can be solved by identical procedures. The dual problem of CS-SVM is carefully scrutinized by means of regularization theory and sensitivity analysis and the CS-SVM algorithm is substantiated. The proposed algorithm is also extended to cost-sensitive learning with example dependent costs. The minimum cost sensitive risk is proposed as the performance measure and is connected to ROC analysis through vector optimization. The resulting algorithm avoids the shortcomings of previous approaches to cost-sensitive SVM design, and is shown to have superior experimental performance on a large number of cost sensitive and imbalanced datasets.
Polynomial-Chaos-based Kriging
Schoebi, R., Sudret, B., Wiart, J.
Computer simulation has become the standard tool in many engineering fields for designing and optimizing systems, as well as for assessing their reliability. To cope with demanding analysis such as optimization and reliability, surrogate models (a.k.a meta-models) have been increasingly investigated in the last decade. Polynomial Chaos Expansions (PCE) and Kriging are two popular non-intrusive meta-modelling techniques. PCE surrogates the computational model with a series of orthonormal polynomials in the input variables where polynomials are chosen in coherency with the probability distributions of those input variables. On the other hand, Kriging assumes that the computer model behaves as a realization of a Gaussian random process whose parameters are estimated from the available computer runs, i.e. input vectors and response values. These two techniques have been developed more or less in parallel so far with little interaction between the researchers in the two fields. In this paper, PC-Kriging is derived as a new non-intrusive meta-modeling approach combining PCE and Kriging. A sparse set of orthonormal polynomials (PCE) approximates the global behavior of the computational model whereas Kriging manages the local variability of the model output. An adaptive algorithm similar to the least angle regression algorithm determines the optimal sparse set of polynomials. PC-Kriging is validated on various benchmark analytical functions which are easy to sample for reference results. From the numerical investigations it is concluded that PC-Kriging performs better than or at least as good as the two distinct meta-modeling techniques. A larger gain in accuracy is obtained when the experimental design has a limited size, which is an asset when dealing with demanding computational models.