Mathematical & Statistical Methods
Greedy and Random Quasi-Newton Methods with Faster Explicit Superlinear Convergence
In this paper, we follow Rodomanov and Nesterov [19]'s work to study quasiNewton methods. We focus on the common SR1 and BFGS quasi-Newton methods to establish better explicit (local) superlinear convergence rates. First, based on the greedy quasi-Newton update which greedily selects the direction to maximize a certain measure of progress, we improve the convergence rate to a conditionnumber-free superlinear convergence rate. Second, based on the random quasiNewton update that selects the direction randomly from a spherically symmetric distribution, we show the same superlinear convergence rate established as above. Our analysis is closely related to the approximation of a given Hessian matrix, unconstrained quadratic objective, as well as the general strongly convex, smooth and strongly self-concordant functions.
The alignment property of SGD noise and how it helps select flat minima: A stability analysis
The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its linear stability (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum θ is linearly stable for SGD, then it must satisfy H(θ) F O( B/η), where H(θ) F,B,η denote the Frobenius norm of Hessian at θ, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum exponentially fast. Hence, for minima accessible to SGD, the sharpness--as measured by the Frobenius norm of the Hessian--is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset.
Learning with little mixing
We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the leastsquares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the L2 and L2+ε norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional ℓ2(N)ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.
Multistage Conditional Compositional Optimization
Şen, Buse, Hu, Yifan, Kuhn, Daniel
We introduce Multistage Conditional Compositional Optimization (MCCO) as a new paradigm for decision-making under uncertainty that combines aspects of multistage stochastic programming and conditional stochastic optimization. MCCO minimizes a nest of conditional expectations and nonlinear cost functions. It has numerous applications and arises, for example, in optimal stopping, linear-quadratic regulator problems, distributionally robust contextual bandits, as well as in problems involving dynamic risk measures. The naïve nested sampling approach for MCCO suffers from the curse of dimensionality familiar from scenario tree-based multistage stochastic programming, that is, its scenario complexity grows exponentially with the number of nests. We develop new multilevel Monte Carlo techniques for MCCO whose scenario complexity grows only polynomially with the desired accuracy.
Biconvex Biclustering
Rosen, Sam, Chi, Eric C., Xu, Jason
This article proposes a biconvex modification to convex biclustering in order to improve its performance in high-dimensional settings. In contrast to heuristics that discard a subset of noisy features a priori, our method jointly learns and accordingly weighs informative features while discovering biclusters. Moreover, the method is adaptive to the data, and is accompanied by an efficient algorithm based on proximal alternating minimization, complete with detailed guidance on hyperparameter tuning and efficient solutions to optimization subproblems. These contributions are theoretically grounded; we establish finite-sample bounds on the objective function under sub-Gaussian errors, and generalize these guarantees to cases where input affinities need not be uniform. Extensive simulation results reveal our method consistently recovers underlying biclusters while weighing and selecting features appropriately, outperforming peer methods. An application to a gene microarray dataset of lymphoma samples recovers biclusters matching an underlying classification, while giving additional interpretation to the mRNA samples via the column groupings and fitted weights.
The Hiremath Early Detection (HED) Score: A Measure-Theoretic Evaluation Standard for Temporal Intelligence
We introduce the Hiremath Early Detection (HED) Score, a principled, measure-theoretic evaluation criterion for quantifying the time-value of information in systems operating over non-stationary stochastic processes subject to abrupt regime transitions. Existing evaluation paradigms, chiefly the ROC/AUC framework and its downstream variants, are temporally agnostic: they assign identical credit to a detection at t + 1 and a detection at t + tau for arbitrarily large tau. This indifference to latency is a fundamental inadequacy in time-critical domains including cyber-physical security, algorithmic surveillance, and epidemiological monitoring. The HED Score resolves this by integrating a baseline-neutral, exponentially decaying kernel over the posterior probability stream of a target regime, beginning precisely at the onset of the regime shift. The resulting scalar simultaneously encodes detection acuity, temporal lead, and pre-transition calibration quality. We prove that the HED Score satisfies three axiomatic requirements: (A1) Temporal Monotonicity, (A2) Invariance to Pre-Attack Bias, and (A3) Sensitivity Decomposability. We further demonstrate that the HED Score admits a natural parametric family indexed by the Hiremath Decay Constant (lambda_H), whose domain-specific calibration constitutes the Hiremath Standard Table. As an empirical vehicle, we present PARD-SSM (Probabilistic Anomaly and Regime Detection via Switching State-Space Models), which couples fractional Stochastic Differential Equations (fSDEs) with a Switching Linear Dynamical System (S-LDS) inference backend. On the NSL-KDD benchmark, PARD-SSM achieves a HED Score of 0.0643, representing a 388.8 percent improvement over a Random Forest baseline (0.0132), with statistical significance confirmed via block-bootstrap resampling (p < 0.001). We propose the HED Score as the successor evaluation standard to ROC/AUC.
Identification and Inference in Nonlinear Dynamic Network Models
We study identification and inference in nonlinear dynamic systems defined on unknown interaction networks. The system evolves through an unobserved dependence matrix governing cross-sectional shock propagation via a nonlinear operator. We show that the network structure is not generically identified, and that identification requires sufficient spectral heterogeneity. In particular, identification arises when the network induces non-exchangeable covariance patterns through heterogeneous amplification of eigenmodes. When the spectrum is concentrated, dependence becomes observationally equivalent to common shocks or scalar heterogeneity, leading to non-identification. We provide necessary and sufficient conditions for identification, characterize observational equivalence classes, and propose a semiparametric estimator with asymptotic theory. We also develop tests for network dependence whose power depends on spectral properties of the interaction matrix. The results apply to a broad class of economic models, including production networks, contagion models, and dynamic interaction systems.
Attributed Network Alignment: Statistical Limits and Efficient Algorithm
Huang, Dong, Tian, Chenyang, Yang, Pengkun
This paper studies the problem of recovering a hidden vertex correspondence between two correlated graphs when both edge weights and node features are observed. While most existing work on graph alignment relies primarily on edge information, many real-world applications provide informative node features in addition to graph topology. To capture this setting, we introduce the featured correlated Gaussian Wigner model, where two graphs are coupled through an unknown vertex permutation, and the node features are correlated under the same permutation. We characterize the optimal information-theoretic thresholds for exact recovery and partial recovery of the latent mapping. On the algorithmic side, we propose QPAlign, an algorithm based on a quadratic programming relaxation, and demonstrate its strong empirical performance on both synthetic and real datasets. Moreover, we also derive theoretical guarantees for the proposed procedure, supporting its reliability and providing convergence guarantees.
Detection of local geometry in random graphs: information-theoretic and computational limits
Bok, Jinho, Li, Shuangping, Yu, Sophie H.
We study the problem of detecting local geometry in random graphs. We introduce a model $\mathcal{G}(n, p, d, k)$, where a hidden community of average size $k$ has edges drawn as a random geometric graph on $\mathbb{S}^{d-1}$, while all remaining edges follow the Erdős--Rényi model $\mathcal{G}(n, p)$. The random geometric graph is generated by thresholding inner products of latent vectors on $\mathbb{S}^{d-1}$, with each edge having marginal probability equal to $p$. This implies that $\mathcal{G}(n, p, d, k)$ and $\mathcal{G}(n, p)$ are indistinguishable at the level of the marginals, and the signal lies entirely in the edge dependencies induced by the local geometry. We investigate both the information-theoretic and computational limits of detection. On the information-theoretic side, our upper bounds follow from three tests based on signed triangle counts: a global test, a scan test, and a constrained scan test; our lower bounds follow from two complementary methods: truncated second moment via Wishart--GOE comparison, and tensorization of KL divergence. These results together settle the detection threshold at $d = \widetildeΘ(k^2 \vee k^6/n^3)$ for fixed $p$, and extend the state-of-the-art bounds from the full model (i.e., $k = n$) for vanishing $p$. On the computational side, we identify a computational--statistical gap and provide evidence via the low-degree polynomial framework, as well as the suboptimality of signed cycle counts of length $\ell \geq 4$.