base function
Supplementary Material for " Multi-task Causal Learning with Gaussian Processes "
Eq. (4) gives the causal operator.1.2 The set C represents the smallest set for which Eq. (2) holds. The conditions in Theorem 3.1 allow for full transfer across all intervention functions in This is equivalent to sampling from the mutilated graph. We compute the integrals in Eqs. Finally, we fix the variance in the likelihood of Eq.
Online Gradient Boosting
Alina Beygelzimer, Elad Hazan, Satyen Kale, Haipeng Luo
We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.
Supplementary Material for " Multi-task Causal Learning with Gaussian Processes "
Eq. (4) gives the causal operator.1.2 The set C represents the smallest set for which Eq. (2) holds. The conditions in Theorem 3.1 allow for full transfer across all intervention functions in This is equivalent to sampling from the mutilated graph. We compute the integrals in Eqs. Finally, we fix the variance in the likelihood of Eq.
Causal rule ensemble approach for multi-arm data
Wan, Ke, Tanioka, Kensuke, Shimokawa, Toshio
Heterogeneous treatment effect (HTE) estimation is critical in medical research. It provides insights into how treatment effects vary among individuals, which can provide statistical evidence for precision medicine. While most existing methods focus on binary treatment situations, real-world applications often involve multiple interventions. However, current HTE estimation methods are primarily designed for binary comparisons and often rely on black-box models, which limit their applicability and interpretability in multi-arm settings. To address these challenges, we propose an interpretable machine learning framework for HTE estimation in multi-arm trials. Our method employs a rule-based ensemble approach consisting of rule generation, rule ensemble, and HTE estimation, ensuring both predictive accuracy and interpretability. Through extensive simulation studies and real data applications, the performance of our method was evaluated against state-of-the-art multi-arm HTE estimation approaches. The results indicate that our approach achieved lower bias and higher estimation accuracy compared with those of existing methods. Furthermore, the interpretability of our framework allows clearer insights into how covariates influence treatment effects, facilitating clinical decision making. By bridging the gap between accuracy and interpretability, our study contributes a valuable tool for multi-arm HTE estimation, supporting precision medicine.
Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study
Zhang, Xingxuan, Wang, Haoran, Li, Jiansheng, Xue, Yuan, Guan, Shikai, Xu, Renzhe, Zou, Hao, Yu, Han, Cui, Peng
Large language models (LLMs) like GPT-4 and LLaMA-3 utilize the powerful in-context learning (ICL) capability of Transformer architecture to learn on the fly from limited examples. While ICL underpins many LLM applications, its full potential remains hindered by a limited understanding of its generalization boundaries and vulnerabilities. We present a systematic investigation of transformers' generalization capability with ICL relative to training data coverage by defining a task-centric framework along three dimensions: inter-problem, intra-problem, and intra-task generalization. Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization. When the training data includes a greater variety of mixed tasks, it significantly enhances the generalization ability of ICL on unseen tasks and even on known simple tasks. This guides us in designing training data to maximize the diversity of tasks covered and to combine different tasks whenever possible, rather than solely focusing on the target task for testing.
Kolmogorov-Arnold Transformer
Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.
Kolmogorov-Arnold Networks (KAN) for Time Series Classification and Robust Analysis
Dong, Chang, Zheng, Liangwei, Chen, Weitong
Kolmogorov-Arnold Networks (KAN) has recently attracted significant attention as a promising alternative to traditional Multi-Layer Perceptrons (MLP). Despite their theoretical appeal, KAN require validation on large-scale benchmark datasets. Time series data, which has become increasingly prevalent in recent years, especially univariate time series are naturally suited for validating KAN. Therefore, we conducted a fair comparison among KAN, MLP, and mixed structures. The results indicate that KAN can achieve performance comparable to, or even slightly better than, MLP across 128 time series datasets. We also performed an ablation study on KAN, revealing that the output is primarily determined by the base component instead of b-spline function. Furthermore, we assessed the robustness of these models and found that KAN and the hybrid structure MLP\_KAN exhibit significant robustness advantages, attributed to their lower Lipschitz constants. This suggests that KAN and KAN layers hold strong potential to be robust models or to improve the adversarial robustness of other models.
A Machine Learning-based Approach for Solving Recurrence Relations and its use in Cost Analysis of Logic Programs
Rustenholz, Louis, Klemen, Maximiliano, Carreira-Perpiรฑรกn, Miguel รngel, Lรณpez-Garcรญa, Pedro
Automatic static cost analysis infers information about the resources used by programs without actually running them with concrete data, and presents such information as functions of input data sizes. Most of the analysis tools for logic programs (and many for other languages), as CiaoPP, are based on setting up recurrence relations representing (bounds on) the computational cost of predicates, and solving them to find closed-form functions. Such recurrence solving is a bottleneck in current tools: many of the recurrences that arise during the analysis cannot be solved with state-of-the-art solvers, including Computer Algebra Systems (CASs), so that specific methods for different classes of recurrences need to be developed. We address such a challenge by developing a novel, general approach for solving arbitrary, constrained recurrence relations, that uses machine-learning (sparse-linear and symbolic) regression techniques to guess a candidate closed-form function, and a combination of an SMT-solver and a CAS to check if it is actually a solution of the recurrence. Our prototype implementation and its experimental evaluation within the context of the CiaoPP system show quite promising results. Overall, for the considered benchmarks, our approach outperforms state-of-the-art cost analyzers and recurrence solvers, and solves recurrences that cannot be solved by them.
Online Gradient Boosting
We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.
ParFam -- Symbolic Regression Based on Continuous Global Optimization
Scholl, Philipp, Bieker, Katharina, Hauger, Hillary, Kutyniok, Gitta
Symbolic regression (SR) describes the task of finding a symbolic function that accurately represents the connection between given input and output data. At the same time, the function should be as simple as possible to ensure robustness against noise and interpretability. This is of particular interest for applications where the aim is to (mathematically) analyze the resulting function afterward or get further insights into the process to ensure trustworthiness, for instance, in physical or chemical sciences (Quade et al., 2016; Angelis et al., 2023; Wang et al., 2019). The range of possible applications of SR is therefore vast, from predicting the dynamics of ecosystems (Chen et al., 2019), forecasting the solar power for energy production (Quade et al., 2016), estimating the development of financial markets (Liu and Guo, 2023), analyzing the stability of certain materials (He and Zhang, 2021) to planning optimal trajectories for robots (Oplatkova and Zelinka, 2007), to name but a few. Moreover, as Angelis et al. (2023) points out, the number of papers on SR has increased significantly in recent years, highlighting the relevance and research interest in this area. SR is a specific regression task in machine learning that aims to find an accurate model without any assumption by the user related to the specific data set.