Statistical Learning
Nonparametric sparsity and regularization
Rosasco, Lorenzo, Villa, Silvia, Mosci, Sofia, Santoro, Matteo, verri, Alessandro
It is now common to see practical applications, for example in bioinformatics and computer vision, where the dimensionality of the data is in the order of hundreds, thousands and even tens of thousands. It is known that learning in such a high dimensional regime is feasible only if the quantity to be estimated satisfies some regularity assumptions [24]. In particular, the idea behind, so called, sparsity is that the quantity of interest depends only on a few relevant variables (dimensions). In turn, this latter assumption is often at the basis of the construction of interpretable data models, since the relevant dimensions allow for a compact, hence interpretable, representation. An instance of the above situation is the problem of learning from samples a multivariate function which depends only on a (possibly small) subset of relevant variables. Detecting such variables is the problem of variable selection. Largely motivated by recent advances in compressed sensing [15, 25], the above problem has been extensively studied under the assumption that the function of interest (target function) depends linearly to the relevant variables.
Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods
A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The social web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The present Ph.D. Thesis deals with the problem of inferring information - or patterns in general - about events emerging in real life based on the contents of this textual stream. We show that it is possible to extract valuable information about social phenomena, such as an epidemic or even rainfall rates, by automatic analysis of the content published in Social Media, and in particular Twitter, using Statistical Machine Learning methods. An important intermediate task regards the formation and identification of features which characterise a target event; we select and use those textual features in several linear, non-linear and hybrid inference approaches achieving a significantly good performance in terms of the applied loss function. By examining further this rich data set, we also propose methods for extracting various types of mood signals revealing how affective norms - at least within the social web's population - evolve during the day and how significant events emerging in the real world are influencing them. Lastly, we present some preliminary findings showing several spatiotemporal characteristics of this textual information as well as the potential of using it to tackle tasks such as the prediction of voting intentions.
The Graphical Lasso: New Insights and Alternatives
Mazumder, Rahul, Hastie, Trevor
The graphical lasso \citep{FHT2007a} is an algorithm for learning the structure in an undirected Gaussian graphical model, using $\ell_1$ regularization to control the number of zeros in the precision matrix ${\B\Theta}={\B\Sigma}^{-1}$ \citep{BGA2008,yuan_lin_07}. The {\texttt R} package \GL\ \citep{FHT2007a} is popular, fast, and allows one to efficiently build a path of models for different values of the tuning parameter. Convergence of \GL\ can be tricky; the converged precision matrix might not be the inverse of the estimated covariance, and occasionally it fails to converge with warm starts. In this paper we explain this behavior, and propose new algorithms that appear to outperform \GL. By studying the "normal equations" we see that, \GL\ is solving the {\em dual} of the graphical lasso penalized likelihood, by block coordinate ascent; a result which can also be found in \cite{BGA2008}. In this dual, the target of estimation is $\B\Sigma$, the covariance matrix, rather than the precision matrix $\B\Theta$. We propose similar primal algorithms \PGL\ and \DPGL, that also operate by block-coordinate descent, where $\B\Theta$ is the optimization target. We study all of these algorithms, and in particular different approaches to solving their coordinate sub-problems. We conclude that \DPGL\ is superior from several points of view.
Structured Prediction Cascades
Weiss, David, Sapp, Benjamin, Taskar, Ben
Structured prediction tasks pose a fundamental trade-off between the need for model complexity to increase predictive power and the limited computational resources for inference in the exponentially-sized output spaces such models require. We formulate and develop the Structured Prediction Cascade architecture: a sequence of increasingly complex models that progressively filter the space of possible outputs. The key principle of our approach is that each model in the cascade is optimized to accurately filter and refine the structured output state space of the next model, speeding up both learning and inference in the next layer of the cascade. We learn cascades by optimizing a novel convex loss function that controls the trade-off between the filtering efficiency and the accuracy of the cascade, and provide generalization bounds for both accuracy and efficiency. We also extend our approach to intractable models using tree-decomposition ensembles, and provide algorithms and theory for this setting. We evaluate our approach on several large-scale problems, achieving state-of-the-art performance in handwriting recognition and human pose recognition. We find that structured prediction cascades allow tremendous speedups and the use of previously intractable features and models in both settings.
Payment Rules through Discriminant-Based Classifiers
Duetting, Paul, Fischer, Felix, Jirapinyo, Pitchayut, Lai, John K., Lubin, Benjamin, Parkes, David C.
In mechanism design it is typical to impose incentive compatibility and then derive an optimal mechanism subject to this constraint. By replacing the incentive compatibility requirement with the goal of minimizing expected ex post regret, we are able to adapt statistical machine learning techniques to the design of payment rules. This computational approach to mechanism design is applicable to domains with multi-dimensional types and situations where computational efficiency is a concern. Specifically, given an outcome rule and access to a type distribution, we train a support vector machine with a special discriminant function structure such that it implicitly establishes a payment rule with desirable incentive properties. We discuss applications to a multi-minded combinatorial auction with a greedy winner-determination algorithm and to an assignment problem with egalitarian outcome rule. Experimental results demonstrate both that the construction produces payment rules with low ex post regret, and that penalizing classification errors is effective in preventing failures of ex post individual rationality.
Toward an Integrated Framework for Automated Development and Optimization of Online Advertising Campaigns
Thomaidou, Stamatina, Vazirgiannis, Michalis, Liakopoulos, Kyriakos
Creating and monitoring competitive and cost-effective pay-per-click advertisement campaigns through the web-search channel is a resource demanding task in terms of expertise and effort. Assisting or even automating the work of an advertising specialist will have an unrivaled commercial value. In this paper we propose a methodology, an architecture, and a fully functional framework for semi- and fully- automated creation, monitoring, and optimization of cost-efficient pay-per-click campaigns with budget constraints. The campaign creation module generates automatically keywords based on the content of the web page to be advertised extended with corresponding ad-texts. These keywords are used to create automatically the campaigns fully equipped with the appropriate values set. The campaigns are uploaded to the auctioneer platform and start running. The optimization module focuses on the learning process from existing campaign statistics and also from applied strategies of previous periods in order to invest optimally in the next period. The objective is to maximize the performance (i.e. clicks, actions) under the current budget constraint. The fully functional prototype is experimentally evaluated on real world Google AdWords campaigns and presents a promising behavior with regards to campaign performance statistics as it outperforms systematically the competing manually maintained campaigns.
Cross-conformal predictors
The method of conformal prediction produces set predictions that are automatically valid in the sense that their unconditional coverage probability is equal to or exceeds a preset confidence level ([14], Chapter 2). A more computationally efficient method of this kind is that of inductive conformal prediction ([12], [14], Section 4.1, [1]). However, inductive conformal predictors are typically less predictively efficient, in the sense of producing larger prediction sets as compared with conformal predictors. Motivated by the method of cross-validation [11, 13], this note explores a hybrid method, which we call cross-conformal prediction. We are mainly interested in the problems of classification and regression, in which we are given a training set consisting of examples, each example consisting of an object and a label, and asked to predict the label of a new test object; in the problem of classification labels are elements of a given finite set, and in the problem of regression labels are real numbers. If we are asked to predict labels for more than one test objects, the same prediction procedure can be applied to each test object separately. In this introductory section and in our empirical studies we consider the problem of binary classification, in which labels can take only two values, which we will encode as 0 and 1. We always assume that the examples (both the training examples and the test examples, consisting of given objects and unknown labels) are generated independently from the same probability measure; this assumption will be called the assumption of randomness.
Ancestral Inference from Functional Data: Statistical Methods and Numerical Examples
Hadjipantelis, Pantelis Z., Jones, Nick S., Moriarty, John, Springate, David, Knight, Christopher G.
Many biological characteristics of evolutionary interest are not scalar variables but continuous functions. Here we use phylogenetic Gaussian process regression to model the evolution of simulated function-valued traits. Given function-valued data only from the tips of an evolutionary tree and utilising independent principal component analysis (IPCA) as a method for dimension reduction, we construct distributional estimates of ancestral function-valued traits, and estimate parameters describing their evolutionary dynamics.
Oracle inequalities for computationally adaptive model selection
Agarwal, Alekh, Bartlett, Peter L., Duchi, John C.
We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our omputational budget to the optimal function class.
Ergodic Mirror Descent
Duchi, John C., Agarwal, Alekh, Johansson, Mikael, Jordan, Michael I.
We generalize stochastic subgradient descent methods to situations in which we do not receive independent samples from the distribution over which we optimize, but instead receive samples that are coupled over time. We show that as long as the source of randomness is suitably ergodic---it converges quickly enough to a stationary distribution---the method enjoys strong convergence guarantees, both in expectation and with high probability. This result has implications for stochastic optimization in high-dimensional spaces, peer-to-peer distributed optimization schemes, decision problems with dependent data, and stochastic optimization problems over combinatorial spaces.