Goto

Collaborating Authors

 neighborhood


The General Theory of Localization Methods

arXiv.org Machine Learning

This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.


On Stability and Decomposition of Sample Quantiles under Heavy-Tailed Distributions

arXiv.org Machine Learning

We study sample quantiles of distributions indexed by estimated parameters, with a on Value-at-Risk related to linear projections of financial returns that whose underlying probability law is heavy-tailed. In this setting, the projection direction and the empirical quantile threshold are estimated from the data, so the standard Bahadur representation under a fixed distribution does not separate the distinct sources of instability. A canonical starting point is Bahadur's representation, which expresses the sample quantile through the empirical distribution function plus a remainder term \cite{bahadur1966}. Empirical-process theory provides a usable scaffolding through the mechanics of half-spaces, symmetric differences, and Glivenko--Cantelli uniform convergence. They yield stability bounds, but absorb changes in projection direction and changes in quantile threshold into a single symmetric-difference measure. Interestingly, a global uniform-convergence requirement is imposed on what is intrinsically a local quantile-stability problem. This paper introduces a Q-Q orthogonality formulation for separating projection-direction and quantile-threshold effects. The object of interest is the difference between the empirical quantile computed using the estimated projection direction and the population quantile computed at the reference projection direction. We decompose this difference into three terms, $\hat q_ฮฑ(\hat w)-q_ฮฑ(w_0)=D_1+D_2+D_3$. Here, $D_1$ measures the population quantile movement induced by perturbing the projection direction, $D_2$ measures the empirical quantile fluctuation with the projection direction held fixed, and $D_3$ is the Bahadur-type remainder.


Leveraging heterogeneity for identifiability: Bayesian order-based learning of multiple DAGs

arXiv.org Machine Learning

We propose a joint order-based scoring framework for causal structure learning of directed acyclic graph (DAG) models under heterogeneous data settings. We show that leveraging heterogeneity improves the accuracy of causal ordering estimation. In the most favorable case, the causal ordering is identifiable up to two permutations. Building on this framework, we propose an order-based Bayesian method for Gaussian DAG models and establish its theoretical properties in the high-dimensional regime. For posterior inference over the space of orderings, we introduce a random-to-random (R2R) proposal neighborhood for the Metropolis-Hastings algorithm, which is theoretically motivated and exhibits efficient mixing behavior. Simulation studies confirm the strong empirical performance of the proposed method, and an application to single-nucleus RNA sequencing data from major depressive disorder demonstrates practical utility.


Finite Sample Bounds for Learning with Score Matching

arXiv.org Machine Learning

Learning of continuous exponential family distributions with unbounded support remains an important area of research for both theory and applications in high-dimensional statistics. In recent years, score matching has become a widely used method for learning exponential families with continuous variables due to its computational ease when compared against maximum likelihood estimation. However, theoretical understanding of the statistical properties of score matching is still lacking. In this work, we provide a non-asymptotic sample complexity analysis for learning the structure of exponential families of polynomials with score matching. The derived sample bounds show a polynomial dependence on the model dimension. These bounds are the first of its kind, as all prior work has shown only asymptotic bounds on the sample complexity.


Nearest-Neighbor Radii under Dependent Sampling

arXiv.org Machine Learning

Nearest-neighbor methods are fundamental to classical and modern machine learning, yet their geometric properties are typically analyzed under independent sampling. In this paper, we study the nearest-neighbor radii under dependent sampling. We consider strong mixing dependent observations and ask whether dependence changes the scale of nearest-neighbor neighborhoods. We establish distribution-free almost sure convergence under polynomial mixing and sharp non-asymptotic moment bounds under geometric mixing. The moment bounds depend on the local intrinsic dimension rather than the ambient dimension, making the results applicable to high-dimensional data concentrated near lower-dimensional manifolds. Synthetic experiments and real-world time-series benchmarks support the theory, showing that nearest-neighbor geometry remains informative under dependence sampling.


Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach

arXiv.org Machine Learning

Stochastic gradient descent (SGD) has been studied extensively over the past decades due to its simplicity and broad applicability in machine learning. In this work, we analyze the local behavior of gradient descent and stochastic gradient descent for minimizing $C^2$-functions that satisfy the Polyak-Lojasiewicz (PL) inequality and under a multiplicative gradient noise model motivated by overparameterized neural networks. Using a geometric interpretation of the PL-condition, we prove a simple yet surprising fact: in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.


A Barrier-Metric First-Order Method for Linearly Constrained Bilevel Optimization

arXiv.org Machine Learning

We study bilevel optimization with a fixed polyhedral lower feasible set. Such problems are challenging for two reasons: active-set changes can make the upper objective nonsmooth, and existing hypergradient methods typically require lower-Hessian inversions or equivalent linear solves, which are computationally expensive. To address these issues, we adopt a logarithmic barrier smoothing of the lower problem to obtain a differentiable approximation of the constrained bilevel objective, and develop a proxy-gradient algorithm for the resulting barrier-smoothed surrogate. The algorithm uses only gradients of the upper and lower objectives; its only second-order object is the explicit logarithmic barrier Hessian determined by the fixed polyhedral constraints. Barrier smoothing restores differentiability, but Euclidean smoothness constants are not uniformly bounded near the boundary. We therefore develop a local Dikin-geometry analysis in which the barrier-metric provides an oracle-free curvature scale near the moving lower centers. This leads to barrier-aware schedules that keep the iterates inside locally well-behaved regions. For the barrier-smoothed objective, we prove stationarity rates of $\widetilde{O}(K^{-2/3})$ in the deterministic setting and $\widetilde{O}(K^{-2/5})$ under upper-level-only bounded stochastic noise after $K$ outer iterations, together with quantitative bias control as the barrier parameter decreases.


Nonparametric estimation of time-varying network connections by multi-stage smoothing

arXiv.org Machine Learning

Time-varying networks arise in a variety of ubiquitous applications, such as functional brain connectivity [Thompson et al., 2017, Zhang et al., 2020], gene and genomic regulatory processes [Zhang and Cao, 2017, Bartlett et al., 2021], and social or economic environments [Snijders et al., 2010, Kolar et al., 2010]. In these contexts, measurements collected at different time points record how observed connections fluctuate, forming a sequence of network snapshots that reflect the temporal evolution of the underlying system. For example, fMRI studies yield time-indexed measurements of activity across brain regions, from which researchers construct connectivity networks that change over the scanning period [Bassett et al., 2011, Rubinov and Sporns, 2010]. Similarly, in political systems such as the U.S. Senate, legislative cosponsorship records give rise to network snapshots that naturally vary across sessions [Fowler, 2006, Kirkland and Gross, 2014]. General reviews of time-varying network analysis, including methodological developments and representative applications, are provided in Holme and Saram aki [2012] and Kim et al. [2018].


Proximal Projection for Doubly Sparse Regularized Models

arXiv.org Machine Learning

Regularization is often used in high-dimensional regression settings to generate a sparse model, which can save tremendous computing resources and identify predictors that are most strongly associated with the response. When the predictors can be represented by a Gaussian graphical model, the structure of the predictor graph can be exploited during regularization. Our proposed model exploits this underlying predictor graph structure by decomposing the estimated coefficient vector into a sum of latent variables that correspond to the sum of each node contribution to the coefficient vector. Regularization is then performed on the latent variables rather than on the coefficient vector directly. We use a penalty function that permits a clear user-defined trade-off between the L1 and L2 penalties and propose a novel proximal projection during optimization. Further, our implementation computes the projection operator for the intersection of selected groups, which conserves more computing resources compared to predictor duplication methods, especially for high-dimensional data. Through simulation, we evaluate the performance of our approach under different graph structures and node counts, and present results on real-world data. Results suggest that our method exhibits stable performance relative to other singly or doubly sparse graphical regression models.


Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models Supplementary material

Neural Information Processing Systems

The appendix is organized into five sections as follows: 1. Appendix A derives the Volterra equation and proves the main result for the homogenized SGD (Theorem 1). 2. We show in Appendix B a heuristic derivation of the homogenized SGD approximation to the SDA class of algorithms on the least squares problem and we show that SGD and homogenized SGD are close under orthogonal invariance (Theorem 2). 3. We give in Appendix C a general overview of the analysis of a convolution Volterra equation of the type that arises in the SDA class. Unless otherwise stated, all the results hold under Assumptions 1 and 2. We include all statements from the previous sections for clarity. The results presented in this paper concern the analysis of existing methods and a new method that is a variant of an existing method. The results are theoretical and we do not anticipate any direct ethical and societal issues. We believe the results will be used by machine learning practitioners and we encourage them to use it to build a more just, prosperous world. A.1 Homogenized SGD We recall that the diffusion model is given by dXt = 2 dZt 1 To connect these diffusions to SGD on the least squares problem (2.1) f(x)= 1 2 kAx bk2, we will use the singular value decomposition of U VT of A. We order the singular values 1 2 3 in decreasing order. We then let t = VT(Xt ex), where we recall that b = Aex+ . We may do a similar computation with N and conclude that: J(1) = 2 2 2jJ 2 1 '(t) '(s)d s,j In summary, we may express J in terms of N by J(1) = 2 2 2jJ 1 '2(t) N(1) + 22 dh t,jiwith J(0) = EH When (k,n)= k+n and thus '(t)=(1+ t) with (t)= 1+t, the corresponding ODE is precisely bJ(3) The other case is when (k,n)= n, or '(t)=exp( t). We call this the general SDAHB; one recovers SDAHB when 1 =, 2 =0, and = .