Mathematical & Statistical Methods
Community Structure in Industrial SAT Instances
Ansótegui, Carlos, Bonet, Maria Luisa, Giráldez-Cru, Jesús, Levy, Jordi, Simon, Laurent
Modern SAT solvers have experienced a remarkable progress on solving industrial instances. Most of the techniques have been developed after an intensive experimental process. It is believed that these techniques exploit the underlying structure of industrial instances. However, there are few works trying to exactly characterize the main features of this structure. The research community on complex networks has developed techniques of analysis and algorithms to study real-world graphs that can be used by the SAT community. Recently, there have been some attempts to analyze the structure of industrial SAT instances in terms of complex networks, with the aim of explaining the success of SAT solving techniques, and possibly improving them. In this paper, inspired by the results on complex networks, we study the community structure, or modularity, of industrial SAT instances. In a graph with clear community structure, or high modularity, we can find a partition of its nodes into communities such that most edges connect variables of the same community. In our analysis, we represent SAT instances as graphs, and we show that most application benchmarks are characterized by a high modularity. On the contrary, random SAT instances are closer to the classical Erd\"os-R\'enyi random graph model, where no structure can be observed. We also analyze how this structure evolves by the effects of the execution of a CDCL SAT solver. In particular, we use the community structure to detect that new clauses learned by the solver during the search contribute to destroy the original structure of the formula. This is, learned clauses tend to contain variables of distinct communities.
Quantitative $W_1$ Convergence of Langevin-Like Stochastic Processes with Non-Convex Potential State-Dependent Noise
Cheng, Xiang, Yin, Dong, Bartlett, Peter L., Jordan, Michael I.
Stochastic Gradient Descent (SGD) is one of the workhorses of modern day machine learning. In many nonconvex optimization problems, such as training deep neural networks, SGD is able to produce solutions with good generalization error. Further, there is evidence that the generalization error of an SGD solution can be significantly better than Gradient Descent (GD) [12]. This suggests that, to understand the behavior of SGD, it is not enough to consider the limiting cases (such as small step-size or large batch-size), when it degenerates to GD. We take an alternate view of SGD as a sampling algorithm, and aim to understand its convergence to an appropriate stationary distribution.
Unified Optimal Analysis of the (Stochastic) Gradient Method
In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $\mu$-strongly convex functions under a (milder than standard) $L$-smoothness assumption. We show that SGD converges after $T$ iterations as $O\left( L \|x_0-x^\star\|^2 \exp \bigl[-\frac{\mu}{4L}T \bigr] + \frac{\sigma^2}{\mu T} \right)$ where $\sigma^2$ measures the variance. For deterministic gradient descent (GD) and SGD in the interpolation setting we have $\sigma^2 =0$ and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
General non-linear Bellman equations
van Hasselt, Hado, Quan, John, Hessel, Matteo, Xu, Zhongwen, Borsa, Diana, Barreto, Andre
We consider a general class of non-linear Bellman equations. These open up a design space of algorithms that have interesting properties, which has two potential advantages. First, we can perhaps better model natural phenomena. For instance, hyperbolic discounting has been proposed as a mathematical model that matches human and animal data well, and can therefore be used to explain preference orderings. We present a different mathematical model that matches the same data, but that makes very different predictions under other circumstances. Second, the larger design space can perhaps lead to algorithms that perform better, similar to how discount factors are often used in practice even when the true objective is undiscounted. We show that many of the resulting Bellman operators still converge to a fixed point, and therefore that the resulting algorithms are reasonable and inherit many beneficial properties of their linear counterparts.
Data Science and Linear Algebra Fundamentals with Python, SciPy, & NumPy - Twilio
It should have opened in your default browser. Once it does, we're ready to go. For those of you who are unfamiliar with Jupyter Notebooks, I've provided a brief review of the functions which will be particularly useful for this tutorial. In the image below, you'll see three buttons labeled 1-3 that will be important for you to get a grasp of the save button (1), the add cell button (2), and the run cell button (3). The first button is the button you'll use to save your work as you go along (1).
Comparing Classifiers: Decision Trees, K-NN & Naive Bayes
A myriad of options exist for classification. That said, three popular classification methods-- Decision Trees, k-NN & Naive Bayes--can be tweaked for practically every situation. Naive Bayes and K-NN, are both examples of supervised learning (where the data comes already labeled). Decision trees are easy to use for small amounts of classes. If you're trying to decide between the three, your best option is to take all three for a test drive on your data, and see which produces the best results.
ADASS: Adaptive Sample Selection for Training Acceleration
Zhao, Shen-Yi, Gao, Hao, Li, Wu-Jun
Stochastic gradient decent~(SGD) and its variants, including some accelerated variants, have become popular for training in machine learning. However, in all existing SGD and its variants, the sample size in each iteration~(epoch) of training is the same as the size of the full training set. In this paper, we propose a new method, called \underline{ada}ptive \underline{s}ample \underline{s}election~(ADASS), for training acceleration. During different epoches of training, ADASS only need to visit different training subsets which are adaptively selected from the full training set according to the Lipschitz constants of the loss functions on samples. It means that in ADASS the sample size in each epoch of training can be smaller than the size of the full training set, by discarding some samples. ADASS can be seamlessly integrated with existing optimization methods, such as SGD and momentum SGD, for training acceleration. Theoretical results show that the learning accuracy of ADASS is comparable to that of counterparts with full training set. Furthermore, empirical results on both shallow models and deep models also show that ADASS can accelerate the training process of existing methods without sacrificing accuracy.
Communication-Efficient Accurate Statistical Estimation
Fan, Jianqing, Guo, Yongyi, Wang, Kaizheng
When the data are stored in a distributed manner, direct application of traditional statistical inference procedures is often prohibitive due to communication cost and privacy concerns. This paper develops and investigates two Communication-Efficient Accurate Statistical Estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicate with the central processor, which then broadcasts aggregated information to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is presented explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived. By regarding the proposed method as a multi-step statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications. In addition, we give the conditions under which the one-step CEASE estimator is statistically efficient. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms.
r/MachineLearning - [R]: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
The authors use a classic Armijo line-search approach in the context of SGD to automatically tune the line search parameter in training the neural networks. They're also able to prove convergence results on minimizing convex and non-convex objective functions satisfying certain growth conditions. An aside, but as an optimization-head myself, it's nice to see some of the traditional optimization ideas make their way into an ML context.
GOT: An Optimal Transport framework for Graph comparison
Maretic, Hermina Petric, Gheche, Mireille EL, Chierchia, Giovanni, Frossard, Pascal
We present a novel framework based on optimal transport for the challenging problem of comparing graphs. Specifically, we exploit the probabilistic distribution of smooth graph signals defined with respect to the graph topology. This allows us to derive an explicit expression of the Wasserstein distance between graph signal distributions in terms of the graph Laplacian matrices. This leads to a structurally meaningful measure for comparing graphs, which is able to take into account the global structure of graphs, while most other measures merely observe local changes independently. Our measure is then used for formulating a new graph alignment problem, whose objective is to estimate the permutation that minimizes the distance between two graphs. We further propose an efficient stochastic algorithm based on Bayesian exploration to accommodate for the non-convexity of the graph alignment problem. We finally demonstrate the performance of our novel framework on different tasks like graph alignment, graph classification and graph signal prediction, and we show that our method leads to significant improvement with respect to the-state-of-art algorithms.