Mathematical & Statistical Methods
Chi-Squared For Feature Selection using SelectKBest
In this video, I'll show you how SelectKBest uses Chi-squared test for feature selection for categorical features & target columns. We calculate Chi-square between each feature & the target & select the desired number of features with best Chi-square scores or the lowest p-values. The Chi-squared (χ2) test is used in statistics to test the independence of two events. More specifically in feature selection we use it to test whether the occurrence of a specific feature & the target are independent or not. For each feature & target combination, a corresponding high χ2 chi-square score or a low p-value indicates that the target column is dependent on the feature column.
Machine learning for causal inference in Biostatistics
General inference problems and quantifying uncertainty have long been the cornerstone of statistical science. While machine learning advances have permeated many disciplines, inference for these procedures, and in particular, causal inference, has not been widespread. However, this is rapidly changing. As different scientific fields begin to converge on machine learning for causal inference, we thought now would be an excellent time to have a public discussion. In our roles as editors of Biostatistics, we decided to organize a series of commentaries on the topic from scholars with expertise in statistics, computer science, epidemiology, health economics, policy, and law.
Consistent recovery threshold of hidden nearest neighbor graphs
Ding, Jian, Wu, Yihong, Xu, Jiaming, Yang, Dana
Jian Ding, Yihong Wu, Jiaming Xu, and Dana Yang November 20, 2019 Abstract Motivated by applications such as discovering strong ties in social networks and assembling genome subsequences in biology, we study the problem of recovering a hidden 2 k -nearest neighbor (NN) graph in an n -vertex complete graph, whose edge weights are independent and distributed according to P n for edges in the hidden 2 k -NN graph and Q n otherwise. We focus on two types of asymptotic recovery guarantees as n: (1) exact recovery: all edges are classified correctly with probability tending to one; (2) almost exact recovery: the expected number of misclassified edges is o (nk). We show that the maximum likelihood estimator achieves (1) exact recovery for 2 k n o(1) if lim inf 2α n log n 1; (2) almost exact recovery for 1 k o null log n log log nnull if lim inf kD ( P n Q n) log n 1, where α n null 2 log null dP ndQ n is the R enyi divergence of order 1 2 and D (P n Q n) is the Kullback-Leibler divergence.
Neutrinos Lead to Unexpected Discovery in Basic Math Quanta Magazine
After breakfast one morning in August, the mathematician Terence Tao opened an email from three physicists he didn't know. The trio explained that they'd stumbled across a simple formula that, if true, established an unexpected relationship between some of the most basic and important objects in linear algebra. The formula "looked too good to be true," said Tao, who is a professor at the University of California, Los Angeles, a Fields medalist, and one of the world's leading mathematicians. "Something this short and simple -- it should have been in textbooks already," he said. "So my first thought was, no, this can't be true."
Understanding the Role of Momentum in Stochastic Gradient Methods
Gitman, Igor, Lang, Hunter, Zhang, Pengchuan, Xiao, Lin
The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavy-ball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.
Adaptive Sampling Quasi-Newton Methods for Derivative-Free Stochastic Optimization
Bollapragada, Raghu, Wild, Stefan M.
We consider stochastic zero-order optimization problems, which arise in settings from simulation optimization to reinforcement learning. We propose an adaptive sampling quasi-Newton method where we estimate the gradients of a stochastic function using finite differences within a common random number framework. We employ modified versions of a norm test and an inner product quasi-Newton test to control the sample sizes used in the stochastic approximations. We provide preliminary numerical experiments to illustrate potential performance benefits of the proposed method.
Non-Gaussianity of Stochastic Gradient Noise
Panigrahi, Abhishek, Somani, Raghav, Goyal, Navin, Netrapalli, Praneeth
What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.
INFORMS Annual Meeting 2019 - IBM Decision Optimization: on Cloud, for Bluemix...
These are the presentations that were made by the IBM Decision Optimization team at the INFORMS Annual Meeting, Seattle, October 2019. Andrea Tramontani, Recent Progress In CPLEX Benders Decomposition In this talk we present the Benders decomposition branch-and-cut that is implemented in CPLEX for Mixed Integer Linear Programming (MILP). We illustrate the main algorithmic components behind our implementation and discuss the latest improvements that are currently work in progress. Finally, we present an extensive computational analysis on some classes of decomposable MILP problems, to assess the performance of Benders branch-and-cut in comparison with the default branch-and-cut of CPLEX. The results show that some models that are out of reach for a "standard" branch-and-cut can instead be solved by Benders decomposition.
Faster Stochastic Algorithms via History-Gradient Aided Batch Size Adaptation
Ji, Kaiyi, Wang, Zhe, Zhou, Yi, Liang, Yingbin
Various schemes for adapting batch size have been recently proposed to accelerate stochastic algorithms. However, existing schemes either apply prescribed batch size adaption or require additional backtracking and condition verification steps to exploit the information along optimization path. In this paper, we propose an easy-to-implement scheme for adapting batch size by exploiting history stochastic gradients, based on which we propose the Adaptive batch size SGD (AbaSGD), AbaSVRG, and AbaSPIDER algorithms. To handle the dependence of the batch size on history stochastic gradients, we develop a new convergence analysis technique, and show that these algorithms achieve improved overall complexity over their vanilla counterparts. Moreover, their convergence rates are adaptive to the optimization landscape that the iterate experiences. Extensive experiments demonstrate that our algorithms substantially outperform existing competitive algorithms.
Sampling random graph homomorphisms and applications to network data analysis
Lyu, Hanbaek, Memoli, Facundo, Sivakoff, David
A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph $F$ into a large network $\mathcal{G}$. When $\mathcal{G}$ is the complete graph with $q$ nodes, this becomes the well-known problem of sampling uniform $q$-colorings of $F$. We propose two complementary MCMC algorithms for sampling a random graph homomorphisms and establish bounds on their mixing times and concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neigborhood sampling. Various time averages of the MCMC trajectory give us real-, function-, and network-valued computable observables, including well-known ones such as homomorphism density and average clustering coefficient. One of the main observable we propose is called the conditional homomorphism density profile, which reveals hierarchical structure of the network. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We also provide various examples and simulations demonstrating our framework through synthetic and real-world networks. For instance, we apply our framework to analyze Word Adjacency Networks of a 45 novels data set and propose an authorship attribution scheme using motif sampling and conditional homomorphism density profiles.