Statistical Learning
Regularized Laplacian Estimation and Fast Eigenvector Approximation
Perry, Patrick O., Mahoney, Michael W.
Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick \emph{approximation} to the first nontrivial eigenvector of a data graph Laplacian \emph{exactly} solve certain regularized Semi-Definite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which $\ell_2$-regularized or $\ell_1$-regularized $\ell_2$-regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefficient vector of the regression problem. Our framework will imply that the solutions to the Mahoney-Orecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the first nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation \emph{implicitly} performs statistical regularization, relative to running the corresponding exact algorithm.
Ground Metric Learning
We consider in this paper the problem of supervised metric learning on normalized histograms. Normalized histograms arise frequently in natural language processing, computer vision, bioinformatics and more generally areas involving complex datatypes. Objects of interest in such areas are usually simplified and each represented as a bag of smaller features. The occurrence frequencies of each of these features in the considered object can then be represented as a histogram. For instance, the representation of images as histograms of pixel colors, SIFT or GIST features (Lowe 1 1999, Oliva and Torralba 2001, Douze et al. 2009); texts as bags-of-words or topic allocations (Joachims 2002, Blei et al. 2003, Blei and Lafferty 2009); sequences as n-grams counts (Leslie et al. 2002) and graphs as histograms of subgraphs (Kashima et al. 2003) all follow this principle. Various distances have been proposed in the statistics and machine learning literatures to compare two histograms(Deza and Deza 2009,§14), (Rachev 1991). Our focus is in this paper is on the family of transportation distances, which is both well motivated theoretically (Villani 2003, §7), (Rachev 1991, §5) and works well empirically (Rubner et al. 1997; 2000, Pele and Werman 2009). Transportation distances are particularly popular in computer vision, where, after the influential work of Rubner et al. (1997), they were called Earth Mover's Distances (EMD). Transportation distances in machine learning can be thought of as metadistances that build upon a metric on the features to form a distance on histograms of features.
An MDL framework for sparse coding and dictionary learning
Ramírez, Ignacio, Sapiro, Guillermo
The power of sparse signal modeling with learned over-complete dictionaries has been demonstrated in a variety of applications and fields, from signal processing to statistical inference and machine learning. However, the statistical properties of these models, such as under-fitting or over-fitting given sets of data, are still not well characterized in the literature. As a result, the success of sparse modeling depends on hand-tuning critical parameters for each data and application. This work aims at addressing this by providing a practical and objective characterization of sparse models by means of the Minimum Description Length (MDL) principle -- a well established information-theoretic approach to model selection in statistical inference. The resulting framework derives a family of efficient sparse coding and dictionary learning algorithms which, by virtue of the MDL principle, are completely parameter free. Furthermore, such framework allows to incorporate additional prior information to existing models, such as Markovian dependencies, or to define completely new problem formulations, including in the matrix analysis area, in a natural way. These virtues will be demonstrated with parameter-free algorithms for the classic image denoising and classification problems, and for low-rank matrix recovery in video applications.
Instant Replay: Investigating statistical Analysis in Sports
Technology has had an unquestionable impact on the way people watch sports. Along with this technological evolution has come a higher standard to ensure a good viewing experience for the casual sports fan. It can be argued that the pervasion of statistical analysis in sports serves to satiate the fan's desire for detailed sports statistics. The goal of statistical analysis in sports is a simple one: to eliminate subjective analysis. In this paper, we review previous work that attempts to analyze various aspects in sports by using ideas from Markov Chains, Bayesian Inference and Markov Chain Monte Carlo (MCMC) methods. The unifying goal of these works is to achieve an accurate representation of the player's ability, the sport, or the environmental effects on the player's performance. With the prevalence of cheap computation, it is possible that using techniques in Artificial Intelligence could improve the result of statistical analysis in sport. This is best illustrated when evaluating football using Neuro Dynamic Programming, a Control Theory paradigm heavily based on theory in Stochastic processes. The results from this method suggest that statistical analysis in sports may benefit from using ideas from the area of Control Theory or Machine Learning
Model Selection Consistency for Cointegrating Regressions
We study the asymptotic properties of the adaptive Lasso in cointegration regressions in the case where all covariates are weakly exogenous. We assume the number of candidate I(1) variables is sub-linear with respect to the sample size (but possibly larger) and the number of candidate I(0) variables is polynomial with respect to the sample size. We show that, under classical conditions used in cointegration analysis, this estimator asymptotically chooses the correct subset of variables in the model and its asymptotic distribution is the same as the distribution of the OLS estimate given the variables in the model were known in beforehand (oracle property). We also derive an algorithm based on the local quadratic approximation and present a numerical study to show the adequacy of the method in finite samples.
Personalized Procedural Content Generation to Minimize Frustration and Boredom Based on Ranking Algorithm
Yu, Hong (Georgia Institute of Technology) | Trawick, Tyler (Georgia Institute of Technology)
A growing research community is working towards procedurally generating content for computer games and simulation applications with various player modeling techniques. In this paper, we present a two-step procedural content generation framework to minimize players' frustration and/or boredom according to player feedback and gameplay features. In the first step, we dynamically categorize the player styles based on a simple questionnaire beforehand and the gameplay features. In the second step, two player models (frustration and boredom) are built for each player style category. A ranking algorithm is utilized for player modeling to address two problems inherent in player feedback: inconsistency and inaccuracy. Experiment results on a testbed game show that our framework can generate less boring/frustrating levels with very high probabilities.
Detecting Real Money Traders in MMORPG by Using Trading Network
Fujita, Atsushi (Future University Hakodate) | Itsuki, Hiroshi (Future University Hakodate) | Matsubara, Hitoshi (Future University Hakodate)
We have developed a method for detecting real money traders (RMTers) to support the operators of massively multiplayer online role-playing games (MMORPGs). RMTers, who earn currency in the real world by selling properties in the virtual world, tend to form alliances and frequently exchange a huge volume of virtual currency within such a community. The proposed method exploits (1) the trading network, to identify the communities of characters, and (2) the volume of trades, to estimate the likelihood of communities and characters becoming engaged in real money trading. The results of an experiment using actual log data from a commercial MMORPG showed that using the trading network is more effective in detecting RMTers than conventional machine learning methods that assess individual character without referring to the trading network.
Asymptotically Independent Markov Sampling: a new MCMC scheme for Bayesian Inference
Beck, James L., Zuev, Konstantin M.
In Bayesian statistics, many problems can be expressed as the evaluation of the expectation of a quantity of interest with respect to the posterior distribution. Standard Monte Carlo method is often not applicable because the encountered posterior distributions cannot be sampled directly. In this case, the most popular strategies are the importance sampling method, Markov chain Monte Carlo, and annealing. In this paper, we introduce a new scheme for Bayesian inference, called Asymptotically Independent Markov Sampling (AIMS), which is based on the above methods. We derive important ergodic properties of AIMS. In particular, it is shown that, under certain conditions, the AIMS algorithm produces a uniformly ergodic Markov chain. The choice of the free parameters of the algorithm is discussed and recommendations are provided for this choice, both theoretically and heuristically based. The efficiency of AIMS is demonstrated with three numerical examples, which include both multi-modal and higher-dimensional target posterior distributions.
On the trade-off between complexity and correlation decay in structural learning algorithms
Bento, José, Montanari, Andrea
We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms often fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it).
Discovering patterns of correlation and similarities in software project data with the Circos visualization tool
Kosti, Makrina Viola, Lazaridou, Sofia, Bourazani, Nikoleta, Angelis, Lefteris
Software cost estimation based on multivariate data from completed projects requires the building of efficient models. These models essentially describe relations in the data, either on the basis of correlations between variables or of similarities between the projects. The continuous growth of the amount of data gathered and the need to perform preliminary analysis in order to discover patterns able to drive the building of reasonable models, leads the researchers towards intelligent and time-saving tools which can effectively describe data and their relationships. The goal of this paper is to suggest an innovative visualization tool, widely used in bioinformatics, which represents relations in data in an aesthetic and intelligent way. In order to illustrate the capabilities of the tool, we use a well known dataset from software engineering projects.