Goto

Collaborating Authors

 diagonal block




Multiresolution Kernel Approximation for Gaussian Process Regression

Yi Ding, Risi Kondor, Jonathan Eskreis-Winkler

Neural Information Processing Systems

Gaussian process regression generally does not scale to beyond a few thousands data points without applying some sort of kernel approximation method. Most approximations focus on the high eigenvalue part of the spectrum of the kernel matrix, K, which leads to bad performance when the length scale of the kernel is small. In this paper we introduce Multiresolution Kernel Approximation (MKA), the first true broad bandwidth kernel approximation algorithm.



On Structured State-Space Duality

Hu, Jerry Yao-Chieh, Zhang, Xiwen, Wu, Weimin, Liu, Han

arXiv.org Machine Learning

Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a $1$-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time $O(T)$ recurrence or as a quadratic-time $O(T^2)$ attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to $1$-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.


We thank all three reviewers for their thorough reviews and constructive feedback

Neural Information Processing Systems

We thank all three reviewers for their thorough reviews and constructive feedback. Otherwise, including additional second order information can make the results worse. "...CGD still requires that the step-size is bounded by one over the max diagonal entry of the Hessian...": Concern 1: Why not use full second order? See also our answer to Reviewer #7. Concern 3: Is CGD scalable?


Fast-VAT: Accelerating Cluster Tendency Visualization using Cython and Numba

Avinash, MSR, Lachheb, Ismael

arXiv.org Artificial Intelligence

Visual Assessment of Cluster Tendency (VAT) is a widely used unsupervised technique to assess the presence of cluster structure in unlabeled datasets. However, its standard implementation suffers from significant performance limitations due to its O(n^2) time complexity and inefficient memory usage. In this work, we present Fast-VAT, a high-performance reimplementation of the VAT algorithm in Python, augmented with Numba's Just-In-Time (JIT) compilation and Cython's static typing and low-level memory optimizations. Our approach achieves up to 50x speedup over the baseline implementation, while preserving the output fidelity of the original method. We validate Fast-VAT on a suite of real and synthetic datasets -- including Iris, Mall Customers, and Spotify subsets -- and verify cluster tendency using Hopkins statistics, PCA, and t-SNE. Additionally, we compare VAT's structural insights with clustering results from DBSCAN and K-Means to confirm its reliability.


DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Gurses, Selcuk, Zhang, Aozhong, Deng, Yanxia, Dong, Xun, Li, Xin, Wang, Naigang, Yin, Penghang, Yang, Zi

arXiv.org Artificial Intelligence

Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.


Reviews: Exact natural gradient in deep linear networks and its application to the nonlinear case

Neural Information Processing Systems

The main result is that the natural gradient completely removes pathological curvature introduced by depth, yielding exponential convergence in the total weights (as though it were a shallow network). The paper traces connections to a variety of previous methods to approximate the Fisher information matrix, and shows a preliminary application of the method to nonlinear networks (for which it is no longer exact), where it appears to speed up convergence. Major comments: This paper presents an elegant analysis of learning dynamics under the natural gradient. Even though the results are obtained for deep linear networks, they are decisive for this case and suggest strongly that future work in this direction could bring principled benefits for the nonlinear case (as shown at small scale in the nonlinear auto encoder experiment). The analysis provides solid intuitions for prior work on approximating second order methods, including an interesting observation on the structure of the Hessian: it is far from block diagonal, a common assumption in prior work. Yet off diagonal blocks are repeats of diagonal blocks, yielding similar results.


Multiresolution Kernel Approximation for Gaussian Process Regression

Yi Ding, Risi Kondor, Jonathan Eskreis-Winkler

Neural Information Processing Systems

Gaussian process regression generally does not scale to beyond a few thousands data points without applying some sort of kernel approximation method. Most approximations focus on the high eigenvalue part of the spectrum of the kernel matrix, K, which leads to bad performance when the length scale of the kernel is small. In this paper we introduce Multiresolution Kernel Approximation (MKA), the first true broad bandwidth kernel approximation algorithm.