Goto

Collaborating Authors

 estimation error


Diffusion Models Are Statistically Optimal for Learning Low-Dimensional Multi-Modal Distributions

arXiv.org Machine Learning

Score-based diffusion models have demonstrated remarkable empirical success in learning high-dimensional distributions, particularly those exhibiting low-dimensional and multi-modal structures. However, theoretical understanding of their statistical efficiency remains limited. Existing theories typically rely on strong regularity assumptions, such as uniformly bounded densities or globally smooth score functions, which fail to capture such intrinsic structures. In this work, we study the sample complexity of diffusion models for learning distributions supported on a union of low-dimensional subspaces. Assuming that the data distribution within each subspace is subgaussian, we show that diffusion models require at most $\widetilde{O}(\varepsilon^{-k \vee 2})$ samples to achieve $\varepsilon$ error in 1-Wasserstein distance, where $k$ is the intrinsic dimension. This near-optimal convergence rate depends only on the intrinsic dimension and significantly improves upon prior theoretical guarantees that suffer from the curse of dimensionality. Notably, our analysis applies to a broad collection of distributions without imposing smoothness, bounded-density, or log-concavity assumptions. Overall, our results show that diffusion models can statistically adapt to intrinsic low-dimensional structure while naturally accommodating multi-modal data, offering a rigorous theoretical justification for their success in complex high-dimensional learning tasks.


Learning Nonlinear Factor Models with Unknown Monotone Links from Incomplete and Noisy Data

arXiv.org Machine Learning

We study a nonlinear factor model in which observed responses depend on low-rank latent factors through an unknown monotone link function. This setting is challenging and largely underexplored due to severe nonconvexity and identifiability issues. The link function is assumed to lie in a reproducing kernel Hilbert space (RKHS), enabling flexible nonparametric modeling while preserving identifiability. We formulate the problem as the joint recovery of the low-rank factors, loadings, and the nonlinear link function from possibly incomplete and noisy observations and propose a projected block coordinate descent (BCD) algorithm with explicit regularization to address scale and rotational ambiguities. Under mild incoherence of factors and standard sampling conditions, we establish convergence guarantees in both noiseless and noisy regimes, along with sublinear regret bounds for the link-function updates. Our results extend classical linear factor models to a broad nonlinear regime and provide a principled framework for learning nonlinear latent structures. We evaluate the proposed approach using controlled synthetic experiments, indicating promising performance.


Fast Convergence of Policy Regret in Learning Stochastic Optimal Control

arXiv.org Machine Learning

Policy learning in modern operations environments faces a fundamental tension between limited operational data and the large, often continuous, state and action spaces over which good decisions must be identified and deployed. We study value-based policy learning in stochastic optimal control: a greedy policy induced by an estimate of the optimal action-value function $Q^*$ is deployed, and its performance is measured by regret. The empirical success of this approach calls for statistical insight into the structures that enable fast regret convergence. We show that, in continuous action spaces, fast policy learning is induced by three geometric structures: a growth exponent $p$, which quantifies how quickly $Q^*$ separates suboptimal actions from its maximizers; a margin-mass exponent $m$, which controls how much deployment mass lies on states with weak growth; and an action-wise regularity exponent $q$, which measures the smoothness of the $Q^*$-estimation error across actions. Given a $n^{-1/2}$-accurate estimator of $Q^*$, we show that the minimax-optimal policy regret convergence rate is \[ \widetildeฮ˜\left( n^{-\min\left\{\frac{p}{2(p-q)},\frac{m+1}{2m}\right\}} \right), \] up to a logarithmic factor at the boundary between the two regimes. The exponent $q$ is crucial: $q>0$ yields faster-than-$n^{-1/2}$ regret. This regime is natural in operations applications. In particular, we verify $q>0$ under mild regularity conditions in dynamic inventory control and service allocation examples, while the mechanism underlying this fast rate regime extends beyond these settings.


Factor Augmented High-Dimensional SGD

arXiv.org Machine Learning

Stochastic gradient descent (SGD) has been a cornerstone of machine learning since the pioneering work of Robbins & Monro (1951). Beyond its algorithmic simplicity and scalability, SGD has also become a central object of theoretical study, with refined analyses linking its dynamics to implicit regularization, generalization performance, and algorithmic stability. For decades, theoretical analyses of SGD have largely resided within the realm of classical stochastic approximation (Polyak & Juditsky, 1992; Lai, 2003; Bottou et al., 2018), where the data dimension is considered fixed while the sample size tends to infinity. While this regime has yielded foundational insights, it no longer fully reflects the characteristics of modern learning systems. Contemporary applications often operate in regimes where data dimension, sample size, and model complexity grow together, calling for new theoretical tools and perspectives that go beyond traditional asymptotic analyses. In this study, we focus on the learning tasks involving high-dimensional predictors. When SGD is applied directly to such data, the dimensionality of the feature space propagates into the optimization process, resulting in a highdimensional (HD) parameter space. Algorithmically, one trending strategy is to approximate the gradient updates using a low-rank representation to reduce memory costs and accelerate computation (Wang et al., 2018; Vogels et al., 2019; Kozak et al., 2019; Kasiviswanathan, 2021; Zhao et al., 2024). Theoretically, despite the vast literature on SGD, convergence guarantees of HD-SGD remain limited (Garrigos & Gower, 2023; Li et al., 2025).


Sample efficient inductive matrix completion with noise and inexact side information

arXiv.org Machine Learning

Low-rank matrix completion is a widely studied problem with many variants. Inductive matrix completion (IMC) incorporates row and column side information to significantly narrow the search space. Prior work falls into two regimes: methods that exploit this structure to achieve reduced sample complexity but only in noiseless settings, and methods that handle noise but require sample complexity matching the ambient matrix dimension, forfeiting the sample efficiency that side information should provide. In this paper, we close this gap by studying noisy IMC with a nonconvex projected gradient descent algorithm with spectral initialization. Our main technical contribution is establishing a regularity condition for the IMC loss function that holds at the reduced sample complexity determined by the effective problem size, scaling with the side information dimension a rather than the ambient dimension n. This directly yields linear convergence and an estimation error that both depend only on the effective problem size rather than the ambient matrix dimension. We further extend our analysis to the inexact side information setting, demonstrating that the reduced sample complexity is maintained and the estimation error is order-optimal with respect to the inexactness of the side information. Extensive simulations and real-world experiments on the MovieLens dataset validate our theoretical findings.


Robust Tensor Regression with Nonconvexity: Algorithmic and Statistical Theory

arXiv.org Machine Learning

Tensor regression is an important tool for tensor data analysis, but existing works have not considered the impact of outliers, making them potentially sensitive to such data points. This paper proposes a low tubal rank robust regression method for analyzing high-dimensional tensor data with heavy-tailed random noise. The proposed method is based on a nonconvex relaxation of the tensor tubal rank within a general optimization framework, which allows for nonconvexity in both the loss and penalty functions. We develop an implementable estimation algorithm and establish its global convergence under some mild assumptions. Furthermore, we provide general statistical theories regarding stationary point, including the rates of convergence and bounds on the prediction error. These theoretical results cover many important models, such as linear models, generalized linear models, and Huber regression, and even encompass some nonconvex losses like correntropy and minimum distance criterion-induced losses. Supportive numerical evidence is provided through simulations and application studies.


High Dimensional Structured Superposition Models

Neural Information Processing Systems

High dimensional superposition models characterize observations using parameters which can be written as a sum of multiple component parameters, each with its own structure, e.g., sum of low rank and sparse matrices, sum of sparse and rotated sparse vectors, etc. In this paper, we consider general superposition models which allow sum of any number of component parameters, and each component structure can be characterized by any norm. We present a simple estimator for such models, give a geometric condition under which the components can be accurately estimated, characterize sample complexity of the estimator, and give high probability nonasymptotic bounds on the componentwise estimation error. We use tools from empirical processes and generic chaining for the statistical analysis, and our results, which substantially generalize prior work on superposition models, are in terms of Gaussian widths of suitable sets.


AUnified Model and Dimension for Interactive Estimation

Neural Information Processing Systems

We study an abstract framework for interactive learning called interactive estimation in which the goal is to estimate a target from its "similarity" to points queried by the learner. We introduce a combinatorial measure called dissimilarity dimension which is used to derive learnability bounds in our model. We present a simple, general, and broadly-applicable algorithm, for which we obtain both regret and PAC generalization bounds that are polynomial in the new dimension. We show that our framework subsumes and thereby unifies two classic learning models: statistical-query learning and structured bandits. We also delineate how the dissimilarity dimension is related to well-known parameters for both frameworks, in some cases yielding significantly improved analyses.


Mixed Membership sub-Gaussian Models

arXiv.org Machine Learning

The Gaussian mixture model is widely used in unsupervised learning, owing to its simplicity and interpretability. However, a fundamental limitation of the classical Gaussian mixture model is that it forces each observation to belong to exactly one component. In many practical applications, such as genetics, social network analysis, and text mining, an observation may naturally belong to multiple components or exhibit partial membership in several latent components. To overcome this limitation, we propose the mixed membership sub-Gaussian model, which extends the classical Gaussian mixture framework by allowing each observation to belong to multiple components. This model inherits the interpretability of the classical Gaussian mixture model while offering greater flexibility for capturing complex overlapping structures. We develop an efficient spectral algorithm to estimate the mixed membership of each individual observation, and under mild separation conditions on the component centres, we prove that the estimation error of the per-individual membership vector can be made arbitrarily small with high probability. To our knowledge, this is the first work to provide a computationally efficient estimator with such a vanishing-error guarantee for a mixed-membership extension of the Gaussian mixture model. Extensive experimental studies demonstrate that our method outperforms existing approaches that ignore mixed memberships.