Bresler, Guy
Efficient reductions between some statistical models
Lou, Mengqi, Bresler, Guy, Pananjady, Ashwin
We study the problem of approximately transforming a sample from a source statistical model to a sample from a target statistical model without knowing the parameters of the source model, and construct several computationally efficient such reductions between statistical experiments. In particular, we provide computationally efficient procedures that approximately reduce uniform, Erlang, and Laplace location models to general target families. We illustrate our methodology by establishing nonasymptotic reductions between some canonical high-dimensional problems, spanning mixtures of experts, phase retrieval, and signal denoising. Notably, the reductions are structure preserving and can accommodate missing data. We also point to a possible application in transforming one differentially private mechanism to another.
The staircase property: How hierarchical structure can guide deep learning
Abbe, Emmanuel, Boix-Adsera, Enric, Brennan, Matthew, Bresler, Guy, Nagaraj, Dheeraj
This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that staircase properties have a role to play in understanding the capabilities of gradient-based learning on regular networks, in contrast to general polynomial-size networks that can emulate any SQ or PAC algorithms as recently shown.
The Algorithmic Phase Transition of Random $k$-SAT for Low Degree Polynomials
Bresler, Guy, Huang, Brice
Let $\Phi$ be a uniformly random $k$-SAT formula with $n$ variables and $m$ clauses. We study the algorithmic task of finding a satisfying assignment of $\Phi$. It is known that a satisfying assignment exists with high probability at clause density $m/n < 2^k \log 2 - \frac{1}{2} (\log 2 + 1) + o_k(1)$, while the best polynomial-time algorithm known, the Fix algorithm of Coja-Oghlan, finds a satisfying assignment at the much lower clause density $(1 - o_k(1)) 2^k \log k / k$. This prompts the question: is it possible to efficiently find a satisfying assignment at higher clause densities? To understand the algorithmic threshold of random $k$-SAT, we study low degree polynomial algorithms, which are a powerful class of algorithms including Fix, Survey Propagation guided decimation (with bounded or mildly growing number of message passing rounds), and paradigms such as message passing and local graph algorithms. We show that low degree polynomial algorithms can find a satisfying assignment at clause density $(1 - o_k(1)) 2^k \log k / k$, matching Fix, and not at clause density $(1 + o_k(1)) \kappa^* 2^k \log k / k$, where $\kappa^* \approx 4.911$. This shows the first sharp (up to constant factor) computational phase transition of random $k$-SAT for a class of algorithms. Our proof establishes and leverages a new many-way overlap gap property tailored to random $k$-SAT.
Statistical Query Algorithms and Low-Degree Tests Are Almost Equivalent
Brennan, Matthew, Bresler, Guy, Hopkins, Samuel B., Li, Jerry, Schramm, Tselil
Researchers currently use a number of approaches to predict and substantiate information-computation gaps in high-dimensional statistical estimation problems. A prominent approach is to characterize the limits of restricted models of computation, which on the one hand yields strong computational lower bounds for powerful classes of algorithms and on the other hand helps guide the development of efficient algorithms. In this paper, we study two of the most popular restricted computational models, the statistical query framework and low-degree polynomials, in the context of high-dimensional hypothesis testing. Our main result is that under mild conditions on the testing problem, the two classes of algorithms are essentially equivalent in power. As corollaries, we obtain new statistical query lower bounds for sparse PCA, tensor PCA and several variants of the planted clique problem.
Learning Restricted Boltzmann Machines with Sparse Latent Variables
Bresler, Guy, Buhai, Rares-Darius
Restricted Boltzmann Machines (RBMs) are a common family of undirected graphical models with latent variables. An RBM is described by a bipartite graph, with all observed variables in one layer and all latent variables in the other. We consider the task of learning an RBM given samples generated according to it. The best algorithms for this task currently have time complexity $\tilde{O}(n^2)$ for ferromagnetic RBMs (i.e., with attractive potentials) but $\tilde{O}(n^d)$ for general RBMs, where $n$ is the number of observed variables and $d$ is the maximum degree of a latent variable. Let the MRF neighborhood of an observed variable be its neighborhood in the Markov Random Field of the marginal distribution of the observed variables. In this paper, we give an algorithm for learning general RBMs with time complexity $\tilde{O}(n^{2^s+1})$, where $s$ is the maximum number of latent variables connected to the MRF neighborhood of an observed variable. This is an improvement when $s < \log_2 (d-1)$, which corresponds to RBMs with sparse latent variables. Furthermore, we give a version of this learning algorithm that recovers a model with small prediction error and whose sample complexity is independent of the minimum potential in the Markov Random Field of the observed variables. This is of interest because the sample complexity of current algorithms scales with the inverse of the minimum potential, which cannot be controlled in terms of natural properties of the RBM.
Reducibility and Statistical-Computational Gaps from Secret Leakage
Brennan, Matthew, Bresler, Guy
Inference problems with conjectured statistical-computational gaps are ubiquitous throughout modern statistics, computer science and statistical physics. While there has been success evidencing these gaps from the failure of restricted classes of algorithms, progress towards a more traditional reduction-based approach to computational complexity in statistical inference has been limited. Existing reductions have largely been limited to inference problems with similar structure -- primarily mapping among problems representable as a sparse submatrix signal plus a noise matrix, which are similar to the common hardness assumption of planted clique. The insight in this work is that a slight generalization of the planted clique conjecture -- secret leakage planted clique -- gives rise to a variety of new average-case reduction techniques, yielding a web of reductions among problems with very different structure. Using variants of the planted clique conjecture for specific forms of secret leakage planted clique, we deduce tight statistical-computational tradeoffs for a diverse range of problems including robust sparse mean estimation, mixtures of sparse linear regressions, robust sparse linear regression, tensor PCA, variants of dense $k$-block stochastic block models, negatively correlated sparse PCA, semirandom planted dense subgraph, detection in hidden partition models and a universality principle for learning sparse mixtures. In particular, a $k$-partite hypergraph variant of the planted clique conjecture is sufficient to establish all of our computational lower bounds. Our techniques also reveal novel connections to combinatorial designs and to random matrix theory. This work gives the first evidence that an expanded set of hardness assumptions, such as for secret leakage planted clique, may be a key first step towards a more complete theory of reductions among statistical problems.
Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms
Bresler, Guy, Jain, Prateek, Nagaraj, Dheeraj, Netrapalli, Praneeth, Wu, Xian
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $\tau_{\mathsf{mix}}$, the mixing time of the underlying Markov chain, under different noise settings. Our results establish that in general, optimization with Markovian data is strictly harder than optimization with independent data and a trivial algorithm (SGD-DD) that works with only one in every $\tilde{\Theta}(\tau_{\mathsf{mix}})$ samples, which are approximately independent, is minimax optimal. In fact, it is strictly better than the popular Stochastic Gradient Descent (SGD) method with constant step-size which is otherwise minimax optimal in the regression with independent data setting. Beyond a worst case analysis, we investigate whether structured datasets seen in practice such as Gaussian auto-regressive dynamics can admit more efficient optimization schemes. Surprisingly, even in this specific and natural setting, Stochastic Gradient Descent (SGD) with constant step-size is still no better than SGD-DD. Instead, we propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate. Our improved rate serves as one of the first results where an algorithm outperforms SGD-DD on an interesting Markov chain and also provides one of the first theoretical analyses to support the use of experience replay in practice.
Sharp Representation Theorems for ReLU Networks with Precise Dependence on Depth
Bresler, Guy, Nagaraj, Dheeraj
We prove sharp dimension-free representation results for neural networks with $D$ ReLU layers under square loss for a class of functions $\mathcal{G}_D$ defined in the paper. These results capture the precise benefits of depth in the following sense: 1. The rates for representing the class of functions $\mathcal{G}_D$ via $D$ ReLU layers is sharp up to constants, as shown by matching lower bounds. 2. For each $D$, $\mathcal{G}_{D} \subseteq \mathcal{G}_{D+1}$ and as $D$ grows the class of functions $\mathcal{G}_{D}$ contains progressively less smooth functions. 3. If $D^{\prime} < D$, then the approximation rate for the class $\mathcal{G}_D$ achieved by depth $D^{\prime}$ networks is strictly worse than that achieved by depth $D$ networks. This constitutes a fine-grained characterization of the representation power of feedforward networks of arbitrary depth $D$ and number of neurons $N$, in contrast to existing representation results which either require $D$ growing quickly with $N$ or assume that the function being represented is highly smooth. In the latter case similar rates can be obtained with a single nonlinear layer. Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions, and indeed, the main technical novelty is to fully exploit the fact that deep networks can produce highly oscillatory functions with few activation functions.
A Latent Source Model for Online Collaborative Filtering
Bresler, Guy, Chen, George H., Shah, Devavrat
Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the online'' setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of $n$ users either likes or dislikes each of $m$ items. We assume there to be $k$ types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again.
Structure learning of antiferromagnetic Ising models
Bresler, Guy, Gamarnik, David, Shah, Devavrat
In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. Our first result is an unconditional computational lower bound of $\Omega (p {d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of statistical algorithms recently introduced by Feldman et al. The construction is related to the notoriously difficult learning parities with noise problem in computational learning theory. Our lower bound shows that the $\widetilde O(p {d 2})$ runtime required by Bresler, Mossel, and Sly's exhaustive-search algorithm cannot be significantly improved without restricting the class of models. Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., most recent papers on structure learning assume that the model has the correlation decay property.