kernel model
Kernel Model Validation: How To Do It, And Why You Should Care
Graziani, Carlo, Ngom, Marieme
Gaussian Process (GP) models are popular tools in uncertainty quantification (UQ) because they purport to furnish functional uncertainty estimates that can be used to represent model uncertainty . It is often difficult to state with precision what probabilistic interpretation attaches to such an uncertainty, and in what way is it calibrated. Without such a calibration statement, the value of such uncertainty estimates is quite limited and qualitative. We motivate the importance of proper probabilistic calibration of GP predictions by describing how GP predictive calibration failures can cause degraded convergence properties in a target optimization algorithm called T argeted Adaptive Design (T AD). We discuss the interpretation of GP-generated uncertainty intervals in UQ, and how one may learn to trust them, through a formal procedure for covariance kernel validation that exploits the multivariate normal nature of GP predictions. We give simple examples of GP regression misspecified 1-dimensional models, and discuss the situation with respect to higher-dimensional models.
Joker: Joint Optimization Framework for Lightweight Kernel Machines
Kernel methods are powerful tools for nonlinear learning with well-established theory. The scalability issue has been their long-standing challenge. Despite the existing success, there are two limitations in large-scale kernel methods: (i) The memory overhead is too high for users to afford; (ii) existing efforts mainly focus on kernel ridge regression (KRR), while other models lack study. In this paper, we propose Joker, a joint optimization framework for diverse kernel models, including KRR, logistic regression, and support vector machines. We design a dual block coordinate descent method with trust region (DBCD-TR) and adopt kernel approximation with randomized features, leading to low memory costs and high efficiency in large-scale learning. Experiments show that Joker saves up to 90\% memory but achieves comparable training time and performance (or even better) than the state-of-the-art methods.
Fast training of large kernel models with delayed projections
Abedsoltan, Amirhesam, Ma, Siyuan, Pandit, Parthe, Belkin, Mikhail
Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes--a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible, pushing the practical limits of kernel-based learning. They have also served as the foundation 2024) leverage the Nystrรถm Approximation (NA) in combination for understanding many significant phenomena in with other strategies to enhance performance. Despite these advantages, ASkotch combines it with block coordinate descent, the scalability of kernel methods has remained a persistent whereas Falkon combines it with the Conjugate Gradient challenge, particularly when applied to large datasets. However, this limitation is critical for expanding the utility these strategies are limited by model size due to memory of kernel-based techniques in modern machine learning applications.
When does compositional structure yield compositional generalization? A kernel theory
Lippl, Samuel, Stachenfeld, Kim
Compositional generalization (the ability to respond correctly to novel combinations of familiar components) is thought to be a cornerstone of intelligent behavior. Compositionally structured (e.g. disentangled) representations are essential for this; however, the conditions under which they yield compositional generalization remain unclear. To address this gap, we present a general theory of compositional generalization in kernel models with fixed, potentially nonlinear representations (which also applies to neural networks in the "lazy regime"). We prove that these models are functionally limited to adding up values assigned to conjunctions/combinations of components that have been seen during training ("conjunction-wise additivity"), and identify novel compositionality failure modes that arise from the data and model structure, even for disentangled inputs. For models in the representation learning (or "rich") regime, we show that networks can generalize on an important non-additive task (associative inference), and give a mechanistic explanation for why. Finally, we validate our theory empirically, showing that it captures the behavior of deep neural networks trained on a set of compositional tasks. In sum, our theory characterizes the principles giving rise to compositional generalization in kernel models and shows how representation learning can overcome their limitations. We further provide a formally grounded, novel generalization class for compositional tasks that highlights fundamental differences in the required learning mechanisms (conjunction-wise additivity).
Finetuning greedy kernel models by exchange algorithms
Kernel based approximation offers versatile tools for high-dimensional approximation, which can especially be leveraged for surrogate modeling. For this purpose, both "knot insertion" and "knot removal" approaches aim at choosing a suitable subset of the data, in order to obtain a sparse but nevertheless accurate kernel model. In the present work, focussing on kernel based interpolation, we aim at combining these two approaches to further improve the accuracy of kernel models, without increasing the computational complexity of the final kernel model. For this, we introduce a class of kernel exchange algorithms (KEA). The resulting KEA algorithm can be used for finetuning greedy kernel surrogate models, allowing for an reduction of the error up to 86.4% (17.2% on average) in our experiments.
On the Nystrom Approximation for Preconditioning in Kernel Machines
Abedsoltan, Amirhesam, Belkin, Mikhail, Pandit, Parthe, Rademacher, Luis
Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.
Toward Large Kernel Models
Abedsoltan, Amirhesam, Belkin, Mikhail, Pandit, Parthe
Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by the discovery of their equivalence to wide neural networks in certain regimes. However, a key feature of DNNs is their ability to scale the model size and training data size independently, whereas in traditional kernel machines model size is tied to data size. Because of this coupling, scaling kernel machines to large data has been computationally challenging. In this paper, we provide a way forward for constructing large-scale general kernel models, which are a generalization of kernel machines that decouples the model and data, allowing training on large datasets. Specifically, we introduce EigenPro 3.0, an algorithm based on projected dual preconditioned SGD and show scaling to model and data sizes which have not been possible with existing kernel methods.
Geometric Graph Learning with Extended Atom-Types Features for Protein-Ligand Binding Affinity Prediction
Rana, Md Masud, Nguyen, Duc Duy
In recent years, graph theories have been widely used in chemical, biological, physical, social, and computer sciences. This is because graphs are useful for representing and analyzing a wide range of practical problems. In molecular modeling, graph representation is widely used since it is a natural way to model their structures, in which graph vertices represent atoms and graph edges represent possible interactions between them. In general, graph theories can be divided into three categories: geometric graph theory, algebraic graph theory, and topological graph theory. Geometric graph theory studies a graph's geometric connectivity, which refers to the pairwise relations among graph nodes or vertices [1]. Algebraic graph theory concerns the algebraic connectivity via the characteristic polynomial, eigenvalues, and eigenvectors of matrices associated with the graph, such as the adjacency matrix or the Laplacian matrix [2, 3]. In topological graph theory, embedding and immersion of graphs are studied along with their association with topological spaces, such as abstract simplicial complexes [4, 5]. There are numerous applications of graphs in chemical analysis and biomolecular modeling [6, 7, 8, 9], such as normal-mode analysis (NMA) [10, 11, 12, 13] and elastic network model (ENM) [14, 15, 16, 17, 18, 19] used to study protein B-factor prediction. Algebraic graph theory has been utilized in some of the most popular elastic network models (ENMs) such as the Gaussian network model (GNM) and the anisotropic network model (ANM).
Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High Dimensions
Sahraee-Ardakan, Mojtaba, Emami, Melikasadat, Pandit, Parthe, Rangan, Sundeep, Fletcher, Alyson K.
Empirical observation of high dimensional phenomena, such as the double descent behaviour, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks. Many recent works analyze such models in a certain high-dimensional regime where the covariates are independent and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime. More surprisingly, when the data is generated by a kernel model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear. These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.
Sparse Least Squares Low Rank Kernel Machines
Fang, Manjing, Xu, Di, Hong, Xia, Gao, Junbin
Abstract--A general framework of least squares support vector machine with low rank kernels, referred to as LR-LSSVM, is introduced in this paper . The special structure of low rank kernels with a controlled model size brings sparsity as well as computational efficiency to the proposed model. Meanwhile, a two-step optimization algorithm with three different criteria is proposed and various experiments are carried out using the example of the so-call robust RBF kernel to validate the model. The experiment results show that the performance of the proposed algorithm is comparable or superior to several existing kernel machines. With the proliferation of big data in scientific and business research, in practical nonlinear modeling approaches, one wishes to build sparse models with more efficient algorithms. Kernel machines (KMs) have attracted great attention since the support vector machines (SVM), a well linear binary classification model under the principle of risk minimization, was introduced in earlier 1990s [1]. In fact, KMs have extended SVM by implementing the linearity in the so-called high dimensional feature space under a feature mapping implicitly determined by a Mercer kernel function.