Education
On the Generalization of the C-Bound to Structured Output Ensemble Methods
Laviolette, François, Morvant, Emilie, Ralaivola, Liva, Roy, Jean-Francis
It is well-known that learning predictive models capable of dealing with outputs that are richer than binary outputs (e.g., multiclass or multilabel) and for which theoretical guarantees exist is still a realm of intensive investigations. From a practical standpoint, a lot of relaxations for learning with complex outputs have been devised. A common approach consists in decomposing the output space into "simpler" spaces so that the learning problem at hand can be reduced to a few easier (i.e., binary) learning tasks. For instance, this is the idea spurred by the Error-Correcting Output Codes (Dietterich & Bakiri, 1995) that makes possible to reduce multiclass or multilabel problems into binary classification tasks,e.g., (Allwein et al., 2001; Mroueh et al., 2012; Read et al., 2011; Tsoumakas & Vlahavas, 2007; Zhang & Schneider, 2012). In our work, we study the problem of complex output prediction by focusing on prediction functions that take the form of a weighted majority vote over a set of complex output classifiers (or voters). Recall that ensemble methods can all be seen as majority vote learning procedures (Dietterich, 2000; Re & Valentini, 2012). Methods such as Bagging (Breiman, 1996), Boosting (Schapire & Singer, 1999) and Random Forests (Breiman, 2001) are representative voting methods. Cortes et al. (2014) have proposed various ensemble methods for the structured output prediction framework. Note also that majority votes are also central to the Bayesian approach (Gelman et al., 2004) with the notion of Bayesian model averaging (Domingos, 2000; Haussler et al., 1994) and most of kernel-based predictors, such as the Support Vector Machines (Boser et al., 1992; Cortes & Vapnik, 1995) may be viewed as weighted majority votes as well: for binary classification, where the predicted class for some input x is computed as the sign of
Probabilistic Curve Learning: Coulomb Repulsion and the Electrostatic Gaussian Process
Learning of low dimensional structure in multidimensional data is a canonical problem in machine learning. One common approach is to suppose that the observed data are close to a lower-dimensional smooth manifold. There are a rich variety of manifold learning methods available, which allow mapping of data points to the manifold. However, there is a clear lack of probabilistic methods that allow learning of the manifold along with the generative distribution of the observed data. The best attempt is the Gaussian process latent variable model (GP-LVM), but identifiability issues lead to poor performance. We solve these issues by proposing a novel Coulomb repulsive process (Corp) for locations of points on the manifold, inspired by physical models of electrostatic interactions among particles. Combining this process with a GP prior for the mapping function yields a novel electrostatic GP (electroGP) process. Focusing on the simple case of a one-dimensional manifold, we develop efficient inference algorithms, and illustrate substantially improved performance in a variety of experiments including filling in missing frames in video.
First-order regret bounds for combinatorial semi-bandits
We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner's expected regret grows as $\widetilde{O}(\sqrt{T})$ with the number of rounds $T$. In this paper, we propose an algorithm that improves this scaling to $\widetilde{O}(\sqrt{{L_T^*}})$, where $L_T^*$ is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.
Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages
Jitkrittum, Wittawat, Gretton, Arthur, Heess, Nicolas, Eslami, S. M. Ali, Lakshminarayanan, Balaji, Sejdinovic, Dino, Szabó, Zoltán
We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernel-based regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel two-layer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.
Fine-Grained Visual Categorization via Multi-stage Metric Learning
Qian, Qi, Jin, Rong, Zhu, Shenghuo, Lin, Yuanqing
Fine-grained visual categorization (FGVC) is to categorize objects into subordinate classes instead of basic classes. One major challenge in FGVC is the co-occurrence of two issues: 1) many subordinate classes are highly correlated and are difficult to distinguish, and 2) there exists the large intra-class variation (e.g., due to object pose). This paper proposes to explicitly address the above two issues via distance metric learning (DML). DML addresses the first issue by learning an embedding so that data points from the same class will be pulled together while those from different classes should be pushed apart from each other; and it addresses the second issue by allowing the flexibility that only a portion of the neighbors (not all data points) from the same class need to be pulled together. However, feature representation of an image is often high dimensional, and DML is known to have difficulty in dealing with high dimensional feature vectors since it would require $\mathcal{O}(d^2)$ for storage and $\mathcal{O}(d^3)$ for optimization. To this end, we proposed a multi-stage metric learning framework that divides the large-scale high dimensional learning problem to a series of simple subproblems, achieving $\mathcal{O}(d)$ computational complexity. The empirical study with FVGC benchmark datasets verifies that our method is both effective and efficient compared to the state-of-the-art FGVC approaches.
Understanding Random Forests: From Theory to Practice
Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyse and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances [...].
Bayesian Network Constraint-Based Structure Learning Algorithms: Parallel and Optimised Implementations in the bnlearn R Package
It is well known in the literature that the problem of learning the structure of Bayesian networks is very hard to tackle: its computational complexity is super-exponential in the number of nodes in the worst case and polynomial in most real-world scenarios. Efficient implementations of score-based structure learning benefit from past and current research in optimisation theory, which can be adapted to the task by using the network score as the objective function to maximise. This is not true for approaches based on conditional independence tests, called constraint-based learning algorithms. The only optimisation in widespread use, backtracking, leverages the symmetries implied by the definitions of neighbourhood and Markov blanket. In this paper we illustrate how backtracking is implemented in recent versions of the bnlearn R package, and how it degrades the stability of Bayesian network structure learning for little gain in terms of speed. As an alternative, we describe a software architecture and framework that can be used to parallelise constraint-based structure learning algorithms (also implemented in bnlearn) and we demonstrate its performance using four reference networks and two real-world data sets from genetics and systems biology. We show that on modern multi-core or multiprocessor hardware parallel implementations are preferable over backtracking, which was developed when single-processor machines were the norm.
Signal Recovery on Graphs: Variation Minimization
Chen, Siheng, Sandryhaila, Aliaksei, Moura, José M. F., Kovačević, Jelena
We consider the problem of signal recovery on graphs as graphs model data with complex structure as signals on a graph. Graph signal recovery implies recovery of one or multiple smooth graph signals from noisy, corrupted, or incomplete measurements. We propose a graph signal model and formulate signal recovery as a corresponding optimization problem. We provide a general solution by using the alternating direction methods of multipliers. We next show how signal inpainting, matrix completion, robust principal component analysis, and anomaly detection all relate to graph signal recovery, and provide corresponding specific solutions and theoretical analysis. Finally, we validate the proposed methods on real-world recovery problems, including online blog classification, bridge condition identification, temperature estimation, recommender system, and expert opinion combination of online blog classification.
Coactive Learning
Shivaswamy, Pannaga, Joachims, Thorsten
We propose Coactive Learning as a model of interaction between a learning system and a human user, where both have the common goal of providing results of maximum utility to the user. Interactions in the Coactive Learning model take the following form: at each step, the system (e.g. search engine) receives a context (e.g. query) and predicts an object (e.g. ranking); the user responds by correcting the system if necessary, providing a slightly improved but not necessarily optimal object as feedback. We argue that such preference feedback can be inferred in large quantity from observable user behavior (e.g., clicks in web search), unlike the optimal feedback required in the expert model or the cardinal valuations required for bandit learning. Despite the relaxed requirements for the feedback, we show that it is possible to adapt many existing online learning algorithms to the coactive framework. In particular, we provide algorithms that achieve square root regret in terms of cardinal utility, even though the learning algorithm never observes cardinal utility values directly. We also provide an algorithm with logarithmic regret in the case of strongly convex loss functions. An extensive empirical study demonstrates the applicability of our model and algorithms on a movie recommendation task, as well as ranking for web search.
Belief Flows of Robust Online Learning
Ortega, Pedro A., Crammer, Koby, Lee, Daniel D.
This paper introduces a new probabilistic model for online learning which dynamically incorporates information from stochastic gradients of an arbitrary loss function. Similar to probabilistic filtering, the model maintains a Gaussian belief over the optimal weight parameters. Unlike traditional Bayesian updates, the model incorporates a small number of gradient evaluations at locations chosen using Thompson sampling, making it computationally tractable. The belief is then transformed via a linear flow field which optimally updates the belief distribution using rules derived from information theoretic principles. Several versions of the algorithm are shown using different constraints on the flow field and compared with conventional online learning algorithms. Results are given for several classification tasks including logistic regression and multilayer neural networks.