Goto

Collaborating Authors

 low rank



Multiresolution Kernel Approximation for Gaussian Process Regression

Neural Information Processing Systems

Gaussian process regression generally does not scale to beyond a few thousands data points without applying some sort of kernel approximation method. Most approximations focus on the high eigenvalue part of the spectrum of the kernel matrix, K, which leads to bad performance when the length scale of the kernel is small. In this paper we introduce Multiresolution Kernel Approximation (MKA), the first true broad bandwidth kernel approximation algorithm.


Compressing Large Language Models using Low Rank and Low Precision Decomposition

Neural Information Processing Systems

This work introduces \rm CALDERA -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix \mathbf{W} by approximating it via a low-rank, low-precision decomposition as \mathbf{W} \approx \mathbf{Q} \mathbf{L}\mathbf{R} . Here, \mathbf{L} and \mathbf{R} are low rank factors, and the entries of \mathbf{Q}, \mathbf{L} and \mathbf{R} are quantized. The model is compressed by substituting each layer with its \mathbf{Q} \mathbf{L}\mathbf{R} decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, \mathbf{L} and \mathbf{R} are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. Theoretical upper bounds on the approximation error of \rm CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

We thank all reviewers for comments. The following is the response to each reviewer. To Reviewer_1: The setting of lambda, lambda2 is implicitly stated in Thm 1, as a particular setting of {\mathcal M} and {\mathcal N} in (3) corresponds to a setting of lambda, lambda2 in (2). However, admittedly, since the setting of \mathcal M (and \mathcal N) are related to feature quality (i.e. To Reviewer_2: In our formulation, we break the target matrix into two parts, R XMY T N. As noted, there are infinite solutions if we don't constrain on the solution space of M and N, as for any M we can let N R-XMT T. However, since R is low rank (says rank k), it is natural to seek a simple and explanatory solution where some of R's subspace (says rank r) is spanned by feature part XMY T and the remaining subspace (rank k-r) is spanned by N. And since XMY T is low rank, it is reasonable to assume M is also low rank, i.e.


Reviews: Learning Nonsymmetric Determinantal Point Processes

Neural Information Processing Systems

This paper studies determinantal point processes (DPP) with non-symmetric kernels. Most of the machine learning literature on DPP assumes symmetric kernels, and the prior work that studies non-symmetric kernels have assumed a quite restricted class of non-symmetric kernels. The novelty of this paper is in proposing the learning algorithm for a fairly general class of non-symmetric kernels. The proposed approach assumes a particular representation of non-symmetric kernels. This representation follows from two known results in a rather straightforward manner, as I also summarize in "1.


Reviews: Practical Methods for Graph Two-Sample Testing

Neural Information Processing Systems

This paper studies the problem of two-sample testing of large graphs under the inhomogeneous Erdos Renyi model. This model is pretty generic, and assumes that an undirected edge (ij) is in the graph with probability P_{ij} independently of all other edges. Most generically the parameter matrix P could be anything symmetric (zero diagonal), but common models are stochastic block model or mixed membership stochastic block model, which both result in P being low rank. Suppose there were two random graph distributions, parameterized by matrices P and Q, and the goal is to test whether P Q or not (the null hypothesis being that they are equal). They assume that the graphs are vertex-aligned, which helps as it reduces the problem of searching over permutations to align the graphs.


Learning Low-Dimensional Metrics

Neural Information Processing Systems

This paper investigates the theoretical foundations of metric learning, focused on three key questions that are not fully addressed in prior work: 1) we consider learning general low-dimensional (low-rank) metrics as well as sparse metrics; 2) we develop upper and lower (minimax) bounds on the generalization error; 3) we quantify the sample complexity of metric learning in terms of the dimension of the feature space and the dimension/rank of the underlying metric; 4) we also bound the accuracy of the learned metric relative to the underlying true generative metric. All the results involve novel mathematical approaches to the metric learning problem, and also shed new light on the special case of ordinal embedding (aka non-metric multidimensional scaling).


Multiresolution Kernel Approximation for Gaussian Process Regression

Neural Information Processing Systems

Gaussian process regression generally does not scale to beyond a few thousands data points without applying some sort of kernel approximation method. Most approximations focus on the high eigenvalue part of the spectrum of the kernel matrix, K, which leads to bad performance when the length scale of the kernel is small. In this paper we introduce Multiresolution Kernel Approximation (MKA), the first true broad bandwidth kernel approximation algorithm.


Matrix Low-Rank Trust Region Policy Optimization

arXiv.org Artificial Intelligence

Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.


ColA: Collaborative Adaptation with Gradient Learning

arXiv.org Artificial Intelligence

A primary function of back-propagation is to compute both the gradient of hidden representations and parameters for optimization with gradient descent. Training large models requires high computational costs due to their vast parameter sizes. While Parameter-Efficient Fine-Tuning (PEFT) methods aim to train smaller auxiliary models to save computational space, they still present computational overheads, especially in Fine-Tuning as a Service (FTaaS) for numerous users. We introduce Collaborative Adaptation (ColA) with Gradient Learning (GL), a parameter-free, model-agnostic fine-tuning approach that decouples the computation of the gradient of hidden representations and parameters. In comparison to PEFT methods, ColA facilitates more cost-effective FTaaS by offloading the computation of the gradient to low-cost devices. We also provide a theoretical analysis of ColA and experimentally demonstrate that ColA can perform on par or better than existing PEFT methods on various benchmarks.