approximant
Muon is Not That Special: Random or Inverted Spectra Work Just as Well
Shumaylov, Zakhar, Da Costa, Nathaël, Zaika, Peter, Mucsányi, Bálint, Massucco, Alex, Gelberg, Yoav, Schönlieb, Carola-Bibiane, Gal, Yarin, Hennig, Philipp
The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.
Why quasicrystals shouldn't exist but are turning up in strange places
Why quasicrystals shouldn't exist but are turning up in strange places Matter with "forbidden" symmetries was once thought to be confined to lab experiments, but is now being found in some of the world's most extreme environments In autumn 1945, Lincoln LaPaz crouched over a patch of scorched ground in the Jornada del Muerto desert of New Mexico. LaPaz, an astronomer, was out hunting for meteorites. He had spotted something in the dust: a strange, glittering crust of blood-red glass. This was no meteorite, but it was striking enough that he held onto it. It wasn't until decades later that anyone would realise quite how special LaPaz's chance find was.
On Uniform Weighted Deep Polynomial approximation
Yeon, Kingsley, Damelin, Steven B.
It is a classical result in rational approximation theory that certain non-smooth or singular functions, such as $|x|$ and $x^{1/p}$, can be efficiently approximated using rational functions with root-exponential convergence in terms of degrees of freedom \cite{Sta, GN}. In contrast, polynomial approximations admit only algebraic convergence by Jackson's theorem \cite{Lub2}. Recent work shows that composite polynomial architectures can recover exponential approximation rates even without smoothness \cite{KY}. In this work, we introduce and analyze a class of weighted deep polynomial approximants tailored for functions with asymmetric behavior-growing unbounded on one side and decaying on the other. By multiplying a learnable deep polynomial with a one-sided weight, we capture both local non-smoothness and global growth. We show numerically that this framework outperforms Taylor, Chebyshev, and standard deep polynomial approximants, even when all use the same number of parameters. To optimize these approximants in practice, we propose a stable graph-based parameterization strategy building on \cite{Jar}.
Approximation Fixpoint Theory with Refined Approximation Spaces
Vanbesien, Linde, Bogaerts, Bart, Denecker, Marc
Approximation Fixpoint Theory (AFT) is a powerful theory covering various semantics of non-monotonic reasoning formalisms in knowledge representation such as Logic Programming and Answer Set Programming. Many semantics of such non-monotonic formalisms can be characterized as suitable fixpoints of a non-monotonic operator on a suitable lattice. Instead of working on the original lattice, AFT operates on intervals in such lattice to approximate or construct the fixpoints of interest. While AFT has been applied successfully across a broad range of non-monotonic reasoning formalisms, it is confronted by its limitations in other, relatively simple, examples. In this paper, we overcome those limitations by extending consistent AFT to deal with approximations that are more refined than intervals. Therefore, we introduce a more general notion of approximation spaces, showcase the improved expressiveness and investigate relations between different approximation spaces.
Converting MLPs into Polynomials in Closed Form
Recent work has shown that purely quadratic functions can replace MLPs in transformers with no significant loss in performance, while enabling new methods of interpretability based on linear algebra. In this work, we theoretically derive closed-form least-squares optimal approximations of feedforward networks (multilayer perceptrons and gated linear units) using polynomial functions of arbitrary degree. When the $R^2$ is high, this allows us to interpret MLPs and GLUs by visualizing the eigendecomposition of the coefficients of their linear and quadratic approximants. We also show that these approximants can be used to create SVD-based adversarial examples. By tracing the $R^2$ of linear and quadratic approximants across training time, we find new evidence that networks start out simple, and get progressively more complex. Even at the end of training, however, our quadratic approximants explain over 95% of the variance in network outputs.
NRSurNN3dq4: A Deep Learning Powered Numerical Relativity Surrogate for Binary Black Hole Waveforms
Freitas, Osvaldo Gramaxo, Theodoropoulos, Anastasios, Villanueva, Nino, Fernandes, Tiago, Nunes, Solange, Font, José A., Onofre, Antonio, Torres-Forné, Alejandro, Martin-Guerrero, José D.
Gravitational wave approximants are widely used tools in gravitational-wave astronomy. They allow for dense coverage of the parameter space of binary black hole (BBH) mergers for purposes of parameter inference, or, more generally, match filtering tasks, while avoiding the computationally expensive full evolution of numerical relativity simulations. However, this comes at a slight cost in terms of accuracy when compared to numerical relativity waveforms, depending on the approach. One way to minimize this is by constructing so-called~\textit{surrogate models} which, instead of using approximate physics or phenomenological formulae, rather interpolate within the space of numerical relativity waveforms. In this work, we introduce~\texttt{NRSurNN3dq4}, a surrogate model for non-precessing BBH merger waveforms powered by neural networks. By relying on the power of deep learning, this approximant is remarkably fast and competitively accurate, as it can generate millions of waveforms in a tenth of a second, while mismatches with numerical relativity waveforms are restrained below $10^{-3}$. We implement this approximant within the~\textsc{bilby} framework for gravitational-wave parameter inference, and show that it it is suitable for parameter estimation tasks.
Taming the Tail: Leveraging Asymmetric Loss and Pade Approximation to Overcome Medical Image Long-Tailed Class Imbalance
Kashyap, Pankhi, Tandon, Pavni, Gupta, Sunny, Tiwari, Abhishek, Kulkarni, Ritwik, Jadhav, Kshitij Sharad
Long-tailed problems in healthcare emerge from data imbalance due to variability in the prevalence and representation of different medical conditions, warranting the requirement of precise and dependable classification methods. Traditional loss functions such as cross-entropy and binary cross-entropy are often inadequate due to their inability to address the imbalances between the classes with high representation and the classes with low representation found in medical image datasets. We introduce a novel polynomial loss function based on Pade approximation, designed specifically to overcome the challenges associated with long-tailed classification. This approach incorporates asymmetric sampling techniques to better classify under-represented classes. We conducted extensive evaluations on three publicly available medical datasets and a proprietary medical dataset. Our implementation of the proposed loss function is open-sourced in the public repository:https://github.com/ipankhi/ALPA.
Attentional Ptycho-Tomography (APT) for three-dimensional nanoscale X-ray imaging with minimal data acquisition and computation time
Kang, Iksung, Wu, Ziling, Jiang, Yi, Yao, Yudong, Deng, Junjing, Klug, Jeffrey, Vogt, Stefan, Barbastathis, George
Noninvasive X-ray imaging of nanoscale three-dimensional objects, e.g. integrated circuits (ICs), generally requires two types of scanning: ptychographic, which is translational and returns estimates of complex electromagnetic field through ICs; and tomographic scanning, which collects complex field projections from multiple angles. Here, we present Attentional Ptycho-Tomography (APT), an approach trained to provide accurate reconstructions of ICs despite incomplete measurements, using a dramatically reduced amount of angular scanning. Training process includes regularizing priors based on typical IC patterns and the physics of X-ray propagation. We demonstrate that APT with 12-time reduced angles achieves fidelity comparable to the gold standard with the original set of angles. With the same set of reduced angles, APT also outperforms baseline reconstruction methods. In our experiments, APT achieves 108-time aggregate reduction in data acquisition and computation without compromising quality. We expect our physics-assisted machine learning framework could also be applied to other branches of nanoscale imaging.
Fast Differentiable Matrix Square Root and Inverse Square Root
Song, Yue, Sebe, Nicu, Wang, Wei
Computing the matrix square root and its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root and the inverse square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad\'e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration. Moreover, we validate the effectiveness of our methods in several real-world applications, including de-correlated batch normalization, second-order vision transformer, global covariance pooling for large-scale and fine-grained recognition, attentive covariance pooling for video recognition, and neural style transfer. The experimental results demonstrate that our methods can also achieve competitive and even slightly better performances. The Pytorch implementation is available at https://github.com/KingJamesSong/FastDifferentiableMatSqrt
A stabilizing reinforcement learning approach for sampled systems with partially unknown models
Beckenbach, Lukas, Osinenko, Pavel, Streif, Stefan
Reinforcement learning is commonly associated with training of reward-maximizing (or cost-minimizing) agents, in other words, controllers. It can be applied in model-free or model-based fashion, using a priori or online collected system data to train involved parametric architectures. In general, online reinforcement learning does not guarantee closed loop stability unless special measures are taken, for instance, through learning constraints or tailored training rules. Particularly promising are hybrids of reinforcement learning with "classical" control approaches. In this work, we suggest a method to guarantee practical stability of the system-controller closed loop in a purely online learning setting, i.e., without offline training. Moreover, we assume only partial knowledge of the system model. To achieve the claimed results, we employ techniques of classical adaptive control. The implementation of the overall control scheme is provided explicitly in a digital, sampled setting. That is, the controller receives the state of the system and computes the control action at discrete, specifically, equidistant moments in time. The method is tested in adaptive traction control and cruise control where it proved to significantly reduce the cost.