Joshi, Nirmit
A Theory of Learning with Autoregressive Chain of Thought
Joshi, Nirmit, Vardi, Gal, Block, Adam, Goel, Surbhi, Li, Zhiyuan, Misiakiewicz, Theodor, Srebro, Nathan
For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.
On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries
Joshi, Nirmit, Misiakiewicz, Theodor, Srebro, Nathan
The goal of this paper is to investigate the complexity of gradient algorithms when learning sparse functions (juntas). We introduce a type of Statistical Queries ($\mathsf{SQ}$), which we call Differentiable Learning Queries ($\mathsf{DLQ}$), to model gradient queries on a specified loss with respect to an arbitrary model. We provide a tight characterization of the query complexity of $\mathsf{DLQ}$ for learning the support of a sparse function over generic product distributions. This complexity crucially depends on the loss function. For the squared loss, $\mathsf{DLQ}$ matches the complexity of Correlation Statistical Queries $(\mathsf{CSQ})$--potentially much worse than $\mathsf{SQ}$. But for other simple loss functions, including the $\ell_1$ loss, $\mathsf{DLQ}$ always achieves the same complexity as $\mathsf{SQ}$. We also provide evidence that $\mathsf{DLQ}$ can indeed capture learning with (stochastic) gradient descent by showing it correctly describes the complexity of learning with a two-layer neural network in the mean field regime and linear scaling.
Exact Community Recovery (under Side Information): Optimality of Spectral Algorithms
Gaudio, Julia, Joshi, Nirmit
In this paper, we study the problem of exact community recovery in general, two-community block models considering both Bernoulli and Gaussian matrix models, capturing the Stochastic Block Model, submatrix localization, and $\mathbb{Z}_2$-synchronization as special cases. We also study the settings where $side$ $information$ about community assignment labels is available, modeled as passing the true labels through a noisy channel: either the binary erasure channel (where some community labels are known while others are erased) or the binary symmetric channel (where some labels are flipped). We provide a unified analysis of the effect of side information on the information-theoretic limits of exact recovery, generalizing prior works and extending to new settings. Additionally, we design a simple but optimal spectral algorithm that incorporates side information (when present) along with the eigenvectors of the matrix observation. Using the powerful tool of entrywise eigenvector analysis [Abbe, Fan, Wang, Zhong 2020], we show that our spectral algorithm can mimic the so called $genie$-$aided$ $estimators$, where the $i^{\mathrm{th}}$ genie-aided estimator optimally computes the estimate of the $i^{\mathrm{th}}$ label, when all remaining labels are revealed by a genie. This perspective provides a unified understanding of the optimality of spectral algorithms for various exact recovery problems in a recent line of work.
The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication
Patel, Kumar Kshitij, Glasgow, Margalit, Zindari, Ali, Wang, Lingxiao, Stich, Sebastian U., Cheng, Ziheng, Joshi, Nirmit, Srebro, Nathan
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.
Community Detection in the Hypergraph SBM: Exact Recovery Given the Similarity Matrix
Gaudio, Julia, Joshi, Nirmit
Community detection is a fundamental problem in network science. In this paper, we consider community detection in hypergraphs drawn from the $hypergraph$ $stochastic$ $block$ $model$ (HSBM), with a focus on exact community recovery. We study the performance of polynomial-time algorithms which operate on the $similarity$ $matrix$ $W$, where $W_{ij}$ reports the number of hyperedges containing both $i$ and $j$. Under this information model, while the precise information-theoretic limit is unknown, Kim, Bandeira, and Goemans derived a sharp threshold up to which the natural min-bisection estimator on $W$ succeeds. As min-bisection is NP-hard in the worst case, they additionally proposed a semidefinite programming (SDP) relaxation and conjectured that it achieves the same recovery threshold as the min-bisection estimator. In this paper, we confirm this conjecture. We also design a simple and highly efficient spectral algorithm with nearly linear runtime and show that it achieves the min-bisection threshold. Moreover, the spectral algorithm also succeeds in denser regimes and is considerably more efficient than previous approaches, establishing it as the method of choice. Our analysis of the spectral algorithm crucially relies on strong $entrywise$ bounds on the eigenvectors of $W$. Our bounds are inspired by the work of Abbe, Fan, Wang, and Zhong, who developed entrywise bounds for eigenvectors of symmetric matrices with independent entries. Despite the complex dependency structure in similarity matrices, we prove similar entrywise guarantees.
Noisy Interpolation Learning with Shallow Univariate ReLU Networks
Joshi, Nirmit, Vardi, Gal, Srebro, Nathan
A recent realization is that although sometimes overfitting can be catastrophic, as suggested by our classic learning theory understanding, in other models overfitting, and even interpolation learning, i.e. insisting on zero training error of noisy data, might not be so catastrophic, allowing for good generalization (low test error) and even consistency [Zhang et al., 2017, Belkin et al., 2018]. This has led to efforts towards understanding the nature of overfitting: how benign or catastrophic it is, and what determines this behavior, in different settings and using different models.