Kaski, Samuel
PABBO: Preferential Amortized Black-Box Optimization
Zhang, Xinyu, Huang, Daolang, Kaski, Samuel, Martinelli, Julien
Preferential Bayesian Optimization (PBO) is a sample-efficient method to learn latent user utilities from preferential feedback over a pair of designs. It relies on a statistical surrogate model for the latent function, usually a Gaussian process, and an acquisition strategy to select the next candidate pair to get user feedback on. Due to the non-conjugacy of the associated likelihood, every PBO step requires a significant amount of computations with various approximate inference techniques. This computational overhead is incompatible with the way humans interact with computers, hindering the use of PBO in real-world cases. Building on the recent advances of amortized BO, we propose to circumvent this issue by fully amortizing PBO, meta-learning both the surrogate and the acquisition function. Our method comprises a novel transformer neural process architecture, trained using reinforcement learning and tailored auxiliary losses. On a benchmark composed of synthetic and real-world datasets, our method is several orders of magnitude faster than the usual Gaussian process-based strategies and often outperforms them in accuracy.
Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization
Alakuijala, Minttu, Gao, Ya, Ananov, Georgy, Kaski, Samuel, Marttinen, Pekka, Ilin, Alexander, Valpola, Harri
As the general capabilities of artificial intelligence (AI) agents continue to evolve, their ability to learn to master multiple complex tasks through experience remains a key challenge. Current LLM agents, particularly those based on proprietary language models, typically rely on prompts to incorporate knowledge about the target tasks. This approach does not allow the agent to internalize this information and instead relies on ever-expanding prompts to sustain its functionality in diverse scenarios. This resembles a system of notes used by a person affected by anterograde amnesia, the inability to form new memories. In this paper, we propose a novel method to train AI agents to incorporate knowledge and skills for multiple tasks without the need for either cumbersome note systems or prior high-quality demonstration data. Our approach employs an iterative process where the agent collects new experiences, receives corrective feedback from humans in the form of hints, and integrates this feedback into its weights via a context distillation training procedure. We demonstrate the efficacy of our approach by implementing it in a Llama-3-based agent which, after only a few rounds of feedback, outperforms advanced models GPT-4o and DeepSeek-V3 in a taskset requiring correct sequencing of information retrieval, tool use, and question answering.
Amortized Bayesian Experimental Design for Decision-Making
Huang, Daolang, Guo, Yujia, Acerbi, Luigi, Kaski, Samuel
Many critical decisions, such as personalized medical diagnoses and product pricing, are made based on insights gained from designing, observing, and analyzing a series of experiments. This highlights the crucial role of experimental design, which goes beyond merely collecting information on system parameters as in traditional Bayesian experimental design (BED), but also plays a key part in facilitating downstream decision-making. Most recent BED methods use an amortized policy network to rapidly design experiments. However, the information gathered through these methods is suboptimal for down-the-line decision-making, as the experiments are not inherently designed with downstream objectives in mind. In this paper, we present an amortized decision-aware BED framework that prioritizes maximizing downstream decision utility. We introduce a novel architecture, the Transformer Neural Decision Process (TNDP), capable of instantly proposing the next experimental design, whilst inferring the downstream decision, thus effectively amortizing both tasks within a unified workflow. We demonstrate the performance of our method across several tasks, showing that it can deliver informative designs and facilitate accurate decision-making.
Towards modeling evolving longitudinal health trajectories with a transformer-based deep learning model
Moen, Hans, Raj, Vishnu, Vabalas, Andrius, Perola, Markus, Kaski, Samuel, Ganna, Andrea, Marttinen, Pekka
Health registers contain rich information about individuals' health histories. Here our interest lies in understanding how individuals' health trajectories evolve in a nationwide longitudinal dataset with coded features, such as clinical codes, procedures, and drug purchases. We introduce a straightforward approach for training a Transformer-based deep learning model in a way that lets us analyze how individuals' trajectories change over time. This is achieved by modifying the training objective and by applying a causal attention mask. We focus here on a general task of predicting the onset of a range of common diseases in a given future forecast interval. However, instead of providing a single prediction about diagnoses that could occur in this forecast interval, our approach enable the model to provide continuous predictions at every time point up until, and conditioned on, the time of the forecast period. We find that this model performs comparably to other models, including a bi-directional transformer model, in terms of basic prediction performance while at the same time offering promising trajectory modeling properties. We explore a couple of ways to use this model for analyzing health trajectories and aiding in early detection of events that forecast possible later disease onsets. We hypothesize that this method may be helpful in continuous monitoring of peoples' health trajectories and enabling interventions in ongoing health trajectories, as well as being useful in retrospective analyses.
Proxy-informed Bayesian transfer learning with unknown sources
Sloman, Sabina J., Martinelli, Julien, Kaski, Samuel
Generalization outside the scope of one's training data requires leveraging prior knowledge about the effects that transfer, and the effects that don't, between different data sources. Bayesian transfer learning is a principled paradigm for specifying this knowledge, and refining it on the basis of data from the source (training) and target (prediction) tasks. We address the challenging transfer learning setting where the learner (i) cannot fine-tune in the target task, and (ii) does not know which source data points correspond to the same task (i.e., the data sources are unknown). We propose a proxy-informed robust method for probabilistic transfer learning (PROMPT), which provides a posterior predictive estimate tailored to the structure of the target task, without requiring the learner have access to any outcome information from the target task. Instead, PROMPT relies on the availability of proxy information. PROMPT uses the same proxy information for two purposes: (i) estimation of effects specific to the target task, and (ii) construction of a robust reweighting of the source data for estimation of effects that transfer between tasks. We provide theoretical results on the effect of this reweighting on the risk of negative transfer, and demonstrate application of PROMPT in two synthetic settings.
Amortized Probabilistic Conditioning for Optimization, Simulation and Inference
Chang, Paul E., Loka, Nasrulloh, Huang, Daolang, Remes, Ulpu, Kaski, Samuel, Acerbi, Luigi
Amortized meta-learning methods based on pre-training have propelled fields like natural language processing and vision. Transformer-based neural processes and their variants are leading models for probabilistic meta-learning with a tractable objective. Often trained on synthetic data, these models implicitly capture essential latent information in the data-generation process. However, existing methods do not allow users to flexibly inject (condition on) and extract (predict) this probabilistic latent information at runtime, which is key to many tasks. We introduce the Amortized Conditioning Engine (ACE), a new transformer-based meta-learning model that explicitly represents latent variables of interest. ACE affords conditioning on both observed data and interpretable latent variables, the inclusion of priors at runtime, and outputs predictive distributions for discrete and continuous data and latents. We show ACE's modeling flexibility and performance in diverse tasks such as image completion and classification, Bayesian optimization, and simulation-based inference.
LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models
Abdi, Hossein, Sun, Mingfei, Zhang, Andi, Kaski, Samuel, Pan, Wei
Training large models with millions or even billions of parameters from scratch incurs substantial computational costs. Parameter Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), address this challenge by adapting only a reduced number of parameters to specific tasks with gradient-based optimizers. In this paper, we cast PEFT as an optimal filtering/state estimation problem and present Low-Rank Kalman Optimizer (LoKO) to estimate the optimal trainable parameters in an online manner. We leverage the low-rank decomposition in LoRA to significantly reduce matrix sizes in Kalman iterations and further capitalize on a diagonal approximation of the covariance matrix to effectively decrease computational complexity from quadratic to linear in the number of trainable parameters. Moreover, we discovered that the initialization of the covariance matrix within the Kalman algorithm and the accurate estimation of the observation noise covariance are the keys in this formulation, and we propose robust approaches that work well across a vast range of well-established computer vision and language models. Our results show that LoKO converges with fewer iterations and yields better performance models compared to commonly used optimizers with LoRA in both image classifications and language tasks. Our study opens up the possibility of leveraging the Kalman filter as an effective optimizer for the online fine-tuning of large models.
Identifying latent disease factors differently expressed in patient subgroups using group factor analysis
Ferreira, Fabio S., Ashburner, John, Bouzigues, Arabella, Suksasilp, Chatrin, Russell, Lucy L., Foster, Phoebe H., Ferry-Bolder, Eve, van Swieten, John C., Jiskoot, Lize C., Seelaar, Harro, Sanchez-Valle, Raquel, Laforce, Robert, Graff, Caroline, Galimberti, Daniela, Vandenberghe, Rik, de Mendonca, Alexandre, Tiraboschi, Pietro, Santana, Isabel, Gerhard, Alexander, Levin, Johannes, Sorbi, Sandro, Otto, Markus, Pasquier, Florence, Ducharme, Simon, Butler, Chris R., Ber, Isabelle Le, Finger, Elizabeth, Tartaglia, Maria C., Masellis, Mario, Rowe, James B., Synofzik, Matthis, Moreno, Fermin, Borroni, Barbara, Kaski, Samuel, Rohrer, Jonathan D., Mourao-Miranda, Janaina
The heterogeneity of neurological and mental health disorders has been a key confound to disease understanding, treatment development and outcome prediction, as patient populations are thought to include multiple disease pathways that selectively respond to treatment (Kapur et al., 2012). These challenges are reflected in poor treatment outcomes; for instance, in depression, approximately only 40% of patients remit after first-line antidepressant treatment or psychotherapy (Amick et al., 2015; Cuijpers et al., 2014; Fava and Davidson, 1996; Trivedi et al., 2006). Diagnostic categories in psychiatry have historically been defined based on signs and symptoms, prioritising diagnostic agreement between clinicians, rather than underlying biological mechanisms (Freedman et al., 2013; Robins and Guze, 1970). Resultingly, the usefulness of supervised machine learning methods as diagnostic tools for mental health disorders (i.e., classifying patients vs. healthy controls) is questionable, as they may simply inherit the flaws of current diagnostic categories. Additional challenges in neurological and mental health disorders are comorbidity (i.e., individuals with one disorder often develop another disorder during their lifespan) and that different disorders can share similar symptoms (Kessler et al., 2005). To address the limitations of current diagnostic categories in psychiatry, the National Institute of Mental Health launched the Research Domain Criteria framework (RDoC) in 2009 (https://www.nimh.nih.gov/research/ 2 research-funded-by-nimh/rdoc) as an attempt to move beyond diagnostic categories and ground psychiatry within neurobiological constructs that combine multiple levels of measures or sources of information (Insel et al., 2010). Multivariate methods, such as Canonical Correlation Analysis (CCA) and related methods, that do not rely on the diagnostic categories, have been widely used to uncover latent disease dimensions capturing associations between brain imaging and non-imaging data (e.g., self-report questionnaires, cognitive tests and genetics). The identified latent dimensions provide information on how a set of non-imaging features (e.g.
Cost-aware Simulation-based Inference
Bharti, Ayush, Huang, Daolang, Kaski, Samuel, Briol, Franรงois-Xavier
Simulation-based inference (SBI) is the preferred framework for estimating parameters of intractable models in science and engineering. A significant challenge in this context is the large computational cost of simulating data from complex models, and the fact that this cost often depends on parameter values. We therefore propose \textit{cost-aware SBI methods} which can significantly reduce the cost of existing sampling-based SBI methods, such as neural SBI and approximate Bayesian computation. This is achieved through a combination of rejection and self-normalised importance sampling, which significantly reduces the number of expensive simulations needed. Our approach is studied extensively on models from epidemiology to telecommunications engineering, where we obtain significant reductions in the overall cost of inference.
Open Ad Hoc Teamwork with Cooperative Game Theory
Wang, Jianhong, Li, Yang, Zhang, Yuan, Pan, Wei, Kaski, Samuel
Ad hoc teamwork poses a challenging problem, requiring the design of an agent to collaborate with teammates without prior coordination or joint training. Open ad hoc teamwork (OAHT) further complicates this challenge by considering environments with a changing number of teammates, referred to as open teams. One promising solution in practice to this problem is leveraging the generalizability of graph neural networks to handle an unrestricted number of agents with various agent-types, named graph-based policy learning (GPL). However, its joint Q-value representation over a coordination graph lacks convincing explanations. In this paper, we establish a new theory to understand the representation of the joint Q-value for OAHT and its learning paradigm, through the lens of cooperative game theory. Building on our theory, we propose a novel algorithm named CIAO, based on GPL's framework, with additional provable implementation tricks that can facilitate learning. The demos of experimental results are available on https://sites.google.com/view/ciao2024, and the code of experiments is published on https://github.com/hsvgbkhgbv/CIAO.