Goto

Collaborating Authors

 Mathematical & Statistical Methods


Randomized Primal-Dual Proximal Block Coordinate Updates

arXiv.org Machine Learning

In this paper we propose a randomized primal-dual proximal block coordinate updating framework for a general multi-block convex optimization model with coupled objective function and linear constraints. Assuming mere convexity, we establish its $O(1/t)$ convergence rate in terms of the objective value and feasibility measure. The framework includes several existing algorithms as special cases such as a primal-dual method for bilinear saddle-point problems (PD-S), the proximal Jacobian ADMM (Prox-JADMM) and a randomized variant of the ADMM method for multi-block convex optimization. Our analysis recovers and/or strengthens the convergence properties of several existing algorithms. For example, for PD-S our result leads to the same order of convergence rate without the previously assumed boundedness condition on the constraint sets, and for Prox-JADMM the new result provides convergence rate in terms of the objective value and the feasibility violation. It is well known that the original ADMM may fail to converge when the number of blocks exceeds two. Our result shows that if an appropriate randomization procedure is invoked to select the updating blocks, then a sublinear rate of convergence in expectation can be guaranteed for multi-block ADMM, without assuming any strong convexity. The new approach is also extended to solve problems where only a stochastic approximation of the (sub-)gradient of the objective is available, and we establish an $O(1/\sqrt{t})$ convergence rate of the extended approach for solving stochastic programming.


MIT Computational Cognitive Science Group

AITopics Original Links

We use empirical methods and formal tools to uncover the mechanisms of human learning and inference. We study the computational basis of human learning and inference. Through a combination of mathematical modeling, computer simulation, and behavioral experiments, we try to uncover the logic behind our everyday inductive leaps: constructing perceptual representations, separating "style" and "content" in perception, learning concepts and words, judging similarity or representativeness, inferring causal connections, noticing coincidences, predicting the future. We approach these topics with a range of empirical methods -- primarily, behavioral testing of adults, children, and machines -- and formal tools -- drawn chiefly from Bayesian statistics and probability theory, but also from geometry, graph theory, and linear algebra. Our work is driven by the complementary goals of trying to achieve a better understanding of human learning in computational terms and trying to build computational systems that come closer to the capacities of human learners.


Getting started with Python and the IPython notebook -- Computational Statistics in Python 0.1 documentation

#artificialintelligence

The IPython notebook is an interactive, web-based environment that allows one to combine code, text and graphics into one unified document. All of the lectures in this course have been developed using this tool. In this lecture, we will introduce the notebook interface and demonstrate some of its features. New: A new version of the IPython notebook knowan as Jupyter supports multiple kernels (differnet languages) and other enhancements. For a tour of its features, see this notebook.


Facebook's advice to students interested in artificial intelligence

#artificialintelligence

That's the gist of the advice to students interested in AI from Facebook's Yann LeCun and Joaquin Quiñonero Candela who run the company's Artificial Intelligence Lab and Applied Machine Learning group respectively. Tech companies often advocate STEM (science, technology, engineering and math), but today's tips are particularly pointed. The pair specifically note that students should eat their vegetables take Calc I, Calc II, Calc III, Linear Algebra, Probability and Statistics as early as possible. From this list, probability and statistics are perhaps the most interesting. From what I remember about high-school, those two subjects are regularly dismissed as too-obvious strategies for skirting the informal AP Calculus preference of top colleges and universities (AP Statistics is often thought of as a cop-out by students).


A Primer on Coordinate Descent Algorithms

arXiv.org Machine Learning

This particular class of algorithms has recently gained popularity due to their effectiveness in solving large-scale optimization problems in machine learning, compressed sensing, image processing, and computational statistics. Coordinate descent algorithms solve optimization problems by successively minimizing along each coordinate or coordinate hyperplane, which is ideal for parallelized and distributed computing. Avoiding detailed technicalities and proofs, this monograph gives relevant theory and examples for practitioners to effectively apply coordinate descent to modern problems in data science and engineering. To keep the primer up-to-date, we intend to publish this monograph only after no additional topics need to be added and we foresee no further major advances in the area. 1 Introduction


How to Build Beautiful 3-D Fractals Out of the Simplest Equations

WIRED

If you came across an animal in the wild and wanted to learn more about it, there are a few things you might do: You might watch what it eats, poke it to see how it reacts, and even dissect it if you got the chance. Mathematicians are not so different from naturalists. Rather than studying organisms, they study equations and shapes using their own techniques. They twist and stretch mathematical objects, translate them into new mathematical languages, and apply them to new problems. As they find new ways to look at familiar things, the possibilities for insight multiply.


Learning from Conditional Distributions via Dual Embeddings

arXiv.org Machine Learning

Many machine learning tasks, such as learning with invariance and policy evaluation in reinforcement learning, can be characterized as problems of learning from conditional distributions. In such problems, each sample $x$ itself is associated with a conditional distribution $p(z|x)$ represented by samples $\{z_i\}_{i=1}^M$, and the goal is to learn a function $f$ that links these conditional distributions to target values $y$. These learning problems become very challenging when we only have limited samples or in the extreme case only one sample from each conditional distribution. Commonly used approaches either assume that $z$ is independent of $x$, or require an overwhelmingly large samples from each conditional distribution. To address these challenges, we propose a novel approach which employs a new min-max reformulation of the learning from conditional distribution problem. With such new reformulation, we only need to deal with the joint distribution $p(z,x)$. We also design an efficient learning algorithm, Embedding-SGD, and establish theoretical sample complexity for such problems. Finally, our numerical experiments on both synthetic and real-world datasets show that the proposed approach can significantly improve over the existing algorithms.


Incremental Variational Sparse Gaussian Process Regression

Neural Information Processing Systems

Recent work on scaling up Gaussian process regression (GPR) to large datasets has primarily focused on sparse GPR, which leverages a small set of basis functions to approximate the full Gaussian process during inference. However, the majority of these approaches are batch methods that operate on the entire training dataset at once, precluding the use of datasets that are streaming or too large to fit into memory. Although previous work has considered incrementally solving variational sparse GPR, most algorithms fail to update the basis functions and therefore perform suboptimally. We propose a novel incremental learning algorithm for variational sparse GPR based on stochastic mirror ascent of probability densities in reproducing kernel Hilbert space. This new formulation allows our algorithm to update basis functions online in accordance with the manifold structure of probability densities for fast convergence. We conduct several experiments and show that our proposed approach achieves better empirical performance in terms of prediction error than the recent state-of-the-art incremental solutions to variational sparse GPR.


Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

Neural Information Processing Systems

We consider empirical risk minimization for large-scale datasets. We introduce Ada Newton as an adaptive algorithm that uses Newton's method with adaptive sample sizes. The main idea of Ada Newton is to increase the size of the training set by a factor larger than one in a way that the minimization variable for the current training set is in the local neighborhood of the optimal argument of the next training set. This allows to exploit the quadratic convergence property of Newton's method and reach the statistical accuracy of each training set with only one iteration of Newton's method. We show theoretically that we can iteratively increase the sample size while applying single Newton iterations without line search and staying within the statistical accuracy of the regularized empirical risk. In particular, we can double the size of the training set in each iteration when the number of samples is sufficiently large. Numerical experiments on various datasets confirm the possibility of increasing the sample size by factor 2 at each iteration which implies that Ada Newton achieves the statistical accuracy of the full training set with about two passes over the dataset.


Estimating the Size of a Large Network and its Communities from a Random Sample

Neural Information Processing Systems

Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intelligence analysis. In this paper we consider a population random graph G = (V;E) from the stochastic block model (SBM) with K communities/blocks. A sample is obtained by randomly choosing a subset W and letting G(W) be the induced subgraph in G of the vertices in W. In addition to G(W), we observe the total degree of each sampled vertex and its block membership. Given this partial information, we propose an efficient PopULation Size Estimation algorithm, called PULSE, that accurately estimates the size of the whole population as well as the size of each community. To support our theoretical analysis, we perform an exhaustive set of experiments to study the effects of sample size, K, and SBM model parameters on the accuracy of the estimates. The experimental results also demonstrate that PULSE significantly outperforms a widely-used method called the network scale-up estimator in a wide variety of scenarios.