dual certificate
Gaussian Mixture Model with unknown diagonal covariances via continuous sparse regularization
Giard, Romane, de Castro, Yohann, Marteau, Clément
This paper addresses the statistical estimation of Gaussian Mixture Models (GMMs) with unknown diagonal covariances from independent and identically distributed samples. We employ the Beurling-LASSO (BLASSO), a convex optimization framework that promotes sparsity in the space of measures, to simultaneously estimate the number of components and their parameters. Our main contribution extends the BLASSO methodology to multivariate GMMs with component-specific unknown diagonal covariance matrices-a significantly more flexible setting than previous approaches requiring known and identical covariances. We establish non-asymptotic recovery guarantees with nearly parametric convergence rates for component means, diagonal covariances, and weights, as well as for density prediction. A key theoretical contribution is the identification of an explicit separation condition on mixture components that enables the construction of non-degenerate dual certificates-essential tools for establishing statistical guarantees for the BLASSO. Our analysis leverages the Fisher-Rao geometry of the statistical model and introduces a novel semi-distance adapted to our framework, providing new insights into the interplay between component separation, parameter space geometry, and achievable statistical recovery.
Efficient Online Large-Margin Classification via Dual Certificates
Ho-Nguyen, Nam, Kılınç-Karzan, Fatma, Nguyen, Ellie, Shen, Lingqing
Online classification is a central problem in optimization, statistical learning and data science. Classical algorithms such as the perceptron offer efficient updates and finite mistake guarantees on linearly separable data, but they do not exploit the underlying geometric structure of the classification problem. We study the offline maximum margin problem through its dual formulation and use the resulting geometric insights to design a principled and efficient algorithm for the online setting. A key feature of our method is its translation invariance, inherited from the offline formulation, which plays a central role in its performance analysis. Our theoretical analysis yields improved mistake and margin bounds that depend only on translation-invariant quantities, offering stronger guarantees than existing algorithms under the same assumptions in favorable settings. In particular, we identify a parameter regime where our algorithm makes at most two mistakes per sequence, whereas the perceptron can be forced to make arbitrarily many mistakes. Our numerical study on real data further demonstrates that our method matches the computational efficiency of existing online algorithms, while significantly outperforming them in accuracy.
Effective regions and kernels in continuous sparse regularisation, with application to sketched mixtures
De Castro, Yohann, Gribonval, Rémi, Jouvin, Nicolas
This TV-regularized convex program on the space of measures allows to recover a sparse measure using a noisy observation from an appropriate measurement operator. While previous works have uncovered the central role played by this operator and its associated kernel in order to get estimation error bounds, the latter requires a technical local positive curvature (LPC) assumption to be verified on a case-by-case basis. In practice, this yields only few LPC-kernels for which this condition is proved. At the heart of our contribution lies the kernel switch, which uncouples the model kernel from the LPC assumption: it enables to leverage any known LPC-kernel as a pivot kernel to prove error bounds, provided embedding conditions are verified between the model and pivot RKHS. We increment the list of LPC-kernels, proving that the "sinc-4" kernel, used for signal recovery and mixture problems, does satisfy the LPC assumption. Furthermore, we also show that the BLASSO localisation error around the true support decreases with the noise level, leading to effective near regions. This improves on known results where this error is fixed with some parameters depending on the model kernel. We illustrate the interest of our results in the case of translation-invariant mixture model estimation, using bandlimiting smoothing and sketching techniques to reduce the computational burden of BLASSO.
How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks
Feature learning has long been considered to be a major advantage of neural networks. However, how gradient-based training algorithms can learn useful features is not well-understood. In particular, the most widely applied analysis for overparametrized neural networks is the neural tangent kernel(NTK)(Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019b). In this setting, the neurons don't move far from their initialization and the features are determined by the network architecture and random initialization (Chizat et al., 2019). While there are empirical and theoretical evidence on the limitation of NTK regime (Chizat et al., 2019; Arora et al., 2019), extending the analysis beyond the NTK regime has been challenging. For 2-layer networks, an alternative framework for analyzing overparametrized neural networks called mean-field analysis was introduced. Earlier mean-field analysis (e.g., Chizat and Bach, 2018; Mei et al., 2018) require either infinite or exponentially many neurons. Later works (e.g., Li et al., 2020; Ge et al., 2021; Bietti et al., 2022; Mahankali et al., 2024) can analyze the training dynamics of mildly overparametrized networks with polynomially many neurons with stronger assumptions on the ground-truth function.
How robust is randomized blind deconvolution via nuclear norm minimization against adversarial noise?
Kostin, Julia, Krahmer, Felix, Stöger, Dominik
In this paper, we study the problem of recovering two unknown signals from their convolution, which is commonly referred to as blind deconvolution. Reformulation of blind deconvolution as a low-rank recovery problem has led to multiple theoretical recovery guarantees in the past decade due to the success of the nuclear norm minimization heuristic. In particular, in the absence of noise, exact recovery has been established for sufficiently incoherent signals contained in lower-dimensional subspaces. However, if the convolution is corrupted by additive bounded noise, the stability of the recovery problem remains much less understood. In particular, existing reconstruction bounds involve large dimension factors and therefore fail to explain the empirical evidence for dimension-independent robustness of nuclear norm minimization. Recently, theoretical evidence has emerged for ill-posed behavior of low-rank matrix recovery for sufficiently small noise levels. In this work, we develop improved recovery guarantees for blind deconvolution with adversarial noise which exhibit square-root scaling in the noise level. Hence, our results are consistent with existing counterexamples which speak against linear scaling in the noise level as demonstrated for related low-rank matrix recovery problems.
Exact nuclear norm, completion and decomposition for random overcomplete tensors via degree-4 SOS
Kivva, Bohdan, Potechin, Aaron
In this paper we show that simple semidefinite programs inspired by degree $4$ SOS can exactly solve the tensor nuclear norm, tensor decomposition, and tensor completion problems on tensors with random asymmetric components. More precisely, for tensor nuclear norm and tensor decomposition, we show that w.h.p. these semidefinite programs can exactly find the nuclear norm and components of an $(n\times n\times n)$-tensor $\mathcal{T}$ with $m\leq n^{3/2}/polylog(n)$ random asymmetric components. For tensor completion, we show that w.h.p. the semidefinite program introduced by Potechin \& Steurer (2017) can exactly recover an $(n\times n\times n)$-tensor $\mathcal{T}$ with $m$ random asymmetric components from only $n^{3/2}m\, polylog(n)$ randomly observed entries. This gives the first theoretical guarantees for exact tensor completion in the overcomplete regime. This matches the best known results for approximate versions of these problems given by Barak \& Moitra (2015) for tensor completion, and Ma, Shi \& Steurer (2016) for tensor decomposition.
Sparse Regularization for Mixture Problems
de Castro, Yohann, Gadat, Sébastien, Marteau, Clément, Maugis-Rabusseau, Cathy
This paper investigates the statistical estimation of a discrete mixing measure $\mu^0$ involved in a kernel mixture model. Using some recent advances in $\ell_1$-regularization over the space of measures, we introduce a "data fitting + regularization" convex program for estimating $\mu^0$ in a grid-less manner, this method is referred to as Beurling-LASSO. Our contribution is two-fold: we derive a lower bound on the bandwidth of our data fitting term depending only on the support of $\mu^0$ and its so-called "minimum separation" to ensure quantitative support localization error bounds; and under a so-called "non-degenerate source condition" we derive a non-asymptotic support stability property. This latter shows that for sufficiently large sample size $n$, our estimator has exactly as many weighted Dirac masses as the target $\mu^0$, converging in amplitude and localization towards the true ones. The statistical performances of this estimator are investigated designing a so-called "dual certificate", which will be appropriate to our setting. Some classical situations, as e.g., Gaussian or ordinary smooth mixtures (e.g., Laplace distributions), are discussed at the end of the paper. We stress in particular that our method is completely adaptive w.r.t. the number of components involved in the mixture.
A Dictionary Based Generalization of Robust PCA
Rambhatla, Sirisha, Li, Xingguo, Haupt, Jarvis
ABSTRACT We analyze the decomposition of a data matrix, assumed to be a superposition of a low-rank component and a component which is sparse in a known dictionary, using a convex demixing method.We provide a unified analysis, encompassing both undercomplete and overcomplete dictionary cases, and show that the constituent components can be successfully recovered undersome relatively mild assumptions up to a certain global sparsity level. Further, we corroborate our theoretical results by presenting empirical evaluations in terms of phase transitions in rank and sparsity for various dictionary sizes. Index Terms-- Low-rank, dictionary sparse, Robust PCA. 1. INTRODUCTION Exploiting the inherent structure of data for the recovery of relevant information is at the heart of data analysis. R. A wide range of problems can be expressed in the form described above. Perhaps the most celebrated of these is principal componentanalysis (PCA) [1], which can be viewed as a special case of eq.(1), with the matrix X, the problem reduces to that of sparse recovery [2-4]; See [5] and references therein for an overview of related works.