Goto

Collaborating Authors

 optimal approximation


The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Amsel, Noah, Persson, David, Musco, Christopher, Gower, Robert M.

arXiv.org Artificial Intelligence

Computing the polar decomposition and the related matrix sign function has been a well-studied problem in numerical analysis for decades. Recently, it has emerged as an important subroutine within the Muon algorithm for training deep neural networks. However, the requirements of this application differ sharply from classical settings: deep learning demands GPU-friendly algorithms that prioritize high throughput over high precision. We introduce Polar Express, a new method for computing the polar decomposition. Like Newton-Schulz and other classical polynomial methods, our approach uses only matrix-matrix multiplications, making it very efficient on GPUs. Inspired by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the update rule at each iteration by solving a minimax optimization problem. We prove that this strategy minimizes error in a worst-case sense, allowing Polar Express to converge as rapidly as possible both in the early iterations and asymptotically. We also address finite-precision issues, making it practical to use in bfloat16. When integrated into the Muon training framework, our method leads to consistent improvements in validation loss when training a GPT-2 model on one billion tokens from the FineWeb dataset, outperforming recent alternatives across a range of learning rates.


Nearly Optimal Approximation of Matrix Functions by the Lanczos Method

Neural Information Processing Systems

Approximating the action of a matrix function f(\vec{A}) on a vector \vec{b} is an increasingly important primitive in machine learning, data science, and statistics, with applications such as sampling high dimensional Gaussians, Gaussian process regression and Bayesian inference, principle component analysis, and approximating Hessian spectral densities.Over the past decade, a number of algorithms enjoying strong theoretical guarantees have been proposed for this task.Many of the most successful belong to a family of algorithms called Krylov subspace methods.Remarkably, a classic Krylov subspace method, called the Lanczos method for matrix functions (Lanczos-FA), frequently outperforms newer methods in practice. Our main result is a theoretical justification for this finding: we show that, for a natural class of rational functions, Lanczos-FA matches the error of the best possible Krylov subspace method up to a multiplicative approximation factor. The approximation factor depends on the degree of f(x) 's denominator and the condition number of \vec{A}, but not on the number of iterations k . Our result provides a strong justification for the excellent performance of Lanczos-FA, especially on functions that are well approximated by rationals, such as the matrix square root.


Review for NeurIPS paper: Optimal Approximation - Smoothness Tradeoffs for Soft-Max Functions

Neural Information Processing Systems

Summary and Contributions: A soft-max function has two main efficiency measures, approximation and smoothness. Authors goal is to identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness. They introduce a soft-max function, called piece-wise linear soft-max, with optimal tradeoff between approximation measured in terms of worst-case additive approximation, and smoothness measured with respect to l -norm. The worst-case approximation guarantee of the piece-wise linear mechanism enforces sparsity in the output of our soft-max function, a property that is known to be important in Machine Learning applications and is not satisfied by the exponential mechanism. Finally, they investigate another soft-max function, called power mechanism, with optimal tradeoff between expected multiplicative approximation and smoothness with respect to the Rényi Divergence, which provides improved theoretical and practical results in differentially private submodular optimization.


Review for NeurIPS paper: Optimal Approximation - Smoothness Tradeoffs for Soft-Max Functions

Neural Information Processing Systems

This paper studies trade-offs between approximation quality and smoothness of "soft"-max functions. The natural exponential function has optimal tradeoff between expected additive approximation and smoothness measured with respect to Renyi divergence but suboptimal when measured via L_p norms. The authors present a new the piecewise linear function which is the optimal one for norms. The reviewers found this to be an well-written and thorough paper on an important problem of broad interest in machine learning. I recommend this paper for acceptance.


Optimal approximation using complex-valued neural networks

Neural Information Processing Systems

Complex-valued neural networks (CVNNs) have recently shown promising empirical success, for instance for increasing the stability of recurrent neural networks and for improving the performance in tasks with complex-valued inputs, such as MRI fingerprinting. While the overwhelming success of Deep Learning in the real-valued case is supported by a growing mathematical foundation, such a foundation is still largely lacking in the complex-valued case. We thus analyze the expressivity of CVNNs by studying their approximation properties. Our results yield the first quantitative approximation bounds for CVNNs that apply to a wide class of activation functions including the popular modReLU and complex cardioid activation functions. Precisely, our results apply to any activation function that is smooth but not polyharmonic on some non-empty open set; this is the natural generalization of the class of smooth and non-polynomial activation functions to the complex setting.


Polynomial-time derivation of optimal k-tree topology from Markov networks

Dastjerdi, Fereshteh R., Cai, Liming

arXiv.org Machine Learning

Characterization of joint probability distribution for large networks of random variables remains a challenging task in data science. Probabilistic graph approximation with simple topologies has practically been resorted to; typically the tree topology makes joint probability computation much simpler and can be effective for statistical inference on insufficient data. However, to characterize network components where multiple variables cooperate closely to influence others, model topologies beyond a tree are needed, which unfortunately are infeasible to acquire. In particular, our previous work has related optimal approximation of Markov networks of tree-width k >=2 closely to the graph-theoretic problem of finding maximum spanning k-tree (MSkT), which is a provably intractable task. This paper investigates optimal approximation of Markov networks with k-tree topology that retains some designated underlying subgraph. Such a subgraph may encode certain background information that arises in scientific applications, for example, about a known significant pathway in gene networks or the indispensable backbone connectivity in the residue interaction graphs for a biomolecule 3D structure. In particular, it is proved that the \beta-retaining MSkT problem, for a number of classes \beta of graphs, admit O(n^{k+1})-time algorithms for every fixed k>= 1. These \beta-retaining MSkT algorithms offer efficient solutions for approximation of Markov networks with k-tree topology in the situation where certain persistent information needs to be retained.


Clustering what Matters: Optimal Approximation for Clustering with Outliers

Agrawal, Akanksha (Indian Institute of Technology, Madras) | Inamdar, Tanmay (a:1:{s:5:"en_US";s:20:"University of Bergen";}) | Saurabh, Saket (Institute of Mathematical Sciences, Chennai, India) | Xue, Jie (NYU Shanghai)

Journal of Artificial Intelligence Research

Clustering with outliers is one of the most fundamental problems in Computer Science. Given a set X of n points and two numbers k, m, the clustering with outliers aims to exclude m points from X and partition the remaining points into k clusters that minimizes a certain cost function. In this paper, we give a general approach for solving clustering with outliers, which results in a fixed-parameter tractable (FPT) algorithm in k and m—i.e., an algorithm with running time of the form f(k, m) · nO(1) for some function f—that almost matches the approximation ratio for its outlier-free counterpart. As a corollary, we obtain FPT approximation algorithms with optimal approximation ratios for k-Median and k-Means with outliers in general and Euclidean metrics. We also exhibit more applications of our approach to other variants of the problem that impose additional constraints on the clustering, such as fairness or matroid constraints.


Conjugate Natural Selection

Raab, Reilly, de Alfaro, Luca, Liu, Yang

arXiv.org Artificial Intelligence

Evolution describes how distributions change. Specifically, evolution provides a model for how a population's distribution of traits or strategies (hereafter hypotheses) changes over time as an environment modulates reproduction rates (i.e., of individuals or of hypotheses; Lloyd, 2020): Hypotheses that have higher fitness are "selected" by the environment and, in expectation, become more popular with time. The replicator equation is a formal, analytic model of evolution and is indispensable to biology (Sinervo and Calsbeek, 2006; Queller, 2017). In the replicator equation, the absolute fitness (in this paper, the negative loss L) of hypotheses h H is identified with its rate of replication: exponential growth (or decline) in a population where different hypotheses compete for relative frequency ρ(h) [0, 1]. For probability distributions over hypothesis space H, this equation induces replicator dynamics, selecting hypotheses with lower than average loss.


Optimal Approximation with Sparse Neural Networks and Applications

Hong, Khay Boon

arXiv.org Artificial Intelligence

We use deep sparsely connected neural networks to measure the complexity of a function class in $L^2(\mathbb R^d)$ by restricting connectivity and memory requirement for storing the neural networks. We also introduce representation system - a countable collection of functions to guide neural networks, since approximation theory with representation system has been well developed in Mathematics. We then prove the fundamental bound theorem, implying a quantity intrinsic to the function class itself can give information about the approximation ability of neural networks and representation system. We also provides a method for transferring existing theories about approximation by representation systems to that of neural networks, greatly amplifying the practical values of neural networks. Finally, we use neural networks to approximate B-spline functions, which are used to generate the B-spline curves. Then, we analyse the complexity of a class called $\beta$ cartoon-like functions using rate-distortion theory and wedgelets construction.


A Simple Proof of Optimal Approximations

Csikós, Mónika, Mustafa, Nabil H.

arXiv.org Machine Learning

The fundamental result of Li, Long, and Srinivasan on approximations of set systems has become a key tool across several communities such as learning theory, algorithms, combinatorics and data analysis (described as `the pinnacle of a long sequence of papers'). The goal of this paper is to give a simpler, self-contained, modular proof of this result for finite set systems. The only ingredient we assume is the standard Chernoff's concentration bound. This makes the proof accessible to a wider audience, readers not familiar with techniques from statistical learning theory, and makes it possible to be covered in a single self-contained lecture in an algorithms course.