pfn
Transformers Can Learn Posterior Predictive Distributions In-Context
Kang, Gyeonghun, Lee, Changwoo J., Cheng, Xiang
Prior-data fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.
Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors
Bergna, Richard, Depeweg, Stefan, Hernández-Lobato, José Miguel
Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic--aleatoric split is not identifiable in general from the posterior predictive distribution alone, even when that distribution is known exactly. We then exploit a distinctive advantage of PFNs: because the synthetic data-generating process is under our control, each task can contain an explicit latent signal and noise function, and the generator can provide query-level labels for both the noiseless target and the observation-noise variance. We use these labels to train a decoupled PFN with separate latent-signal and aleatoric heads. The observation-level predictive is induced by convolving the latent signal distribution with the learned noise model. Empirically, epistemic-only acquisition mitigates the failure mode of total-variance exploration in noisy and heteroscedastic settings. In matched comparisons, decoupled models usually improve over tuned observation-level baselines, with the clearest gains in HPO; in broader sweeps, a decoupled model obtains the best average rank in both HPO and synthetic BO.
TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks
While tabular classification has traditionally relied on from-scratch training, a recent breakthrough called prior-data fitted networks (PFNs) challenges this approach. Similar to large language models, PFNs make use of pretraining and in-context learning to achieve strong performance on new tasks in a single forward pass. However, current PFNs have limitations that prohibit their widespread adoption. Notably, TabPFN achieves very strong performance on small tabular datasets but is not designed to make predictions for datasets of size larger than 1000. In this work, we overcome these limitations and substantially improve the performance of PFNs via context optimization. We introduce TuneTables, a parameter-efficient fine-tuning strategy for PFNs that compresses large datasets into a smaller learned context. We conduct extensive experiments on nineteen algorithms over 98 datasets and find that TuneTables achieves the best performance on average, outperforming boosted trees such as CatBoost, while optimizing fewer than 5\% of TabPFN's parameters. Furthermore, we show that TuneTables can be used as an interpretability tool and can even be used to mitigate biases by optimizing a fairness objective.
Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations
Sharma, Kaustubh, Singh, Simardeep, Pareek, Parikshit
Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. We introduce Decoupled-V alue Attention (DV A)- motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DV A computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DV A mirrors the GP update while remaining kernel-free. We demonstrate that PFNs are backbone architecture invariant and the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five-and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN, RNN and LSTM-based PFNs can perform at par with their Transformer-based counterparts. Bayesian inference provides a powerful framework for reasoning under uncertainty, with methods like Gaussian processes (GPs) offering well-calibrated predictions and principled uncertainty estimates (Williams & Rasmussen, 2006). However, the practical application of these methods is often hindered by the heavy computational burden of learning kernel hyperparameters. For example, exact GP inference scales cubically with the number of data points, making its deployment infeasible for large datasets or problems requiring repeated training. Consider a physical system where a surrogate GP is chosen due to its uncertainty estimates and differentiable closed-form expressions. However, the underlying input dataset and configuration changes frequently, and the surrogate is supposed to work for these new, previously unseen variations. In such conditions, GP needs to be trained repeatedly, incurring significant computing cost, each time the dataset changes.
Foundation Models for Causal Inference via Prior-Data Fitted Networks
Ma, Yuchen, Frauen, Dennis, Javurek, Emil, Feuerriegel, Stefan
Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including for back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train models to perform in-context learning in these settings. We show that CausalFM achieves competitive in-context learning performance even when compared to baselines that are specifically trained for the task at hand. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.
In-Context Decision Making for Optimizing Complex AutoML Pipelines
Balef, Amir Rezaei, Eggensperger, Katharina
Combined Algorithm Selection and Hyperparameter Optimization (CASH) has been fundamental to traditional AutoML systems. However, with the advancements of pre-trained models, modern ML workflows go beyond hyperparameter optimization and often require fine-tuning, ensembling, and other adaptation techniques. While the core challenge of identifying the best-performing model for a downstream task remains, the increasing heterogeneity of ML pipelines demands novel AutoML approaches. This work extends the CASH framework to select and adapt modern ML pipelines. We propose PS-PFN to efficiently explore and exploit adapting ML pipelines by extending Posterior Sampling (PS) to the max k -armed bandit problem setup. PS-PFN leverages prior-data fitted networks (PFNs) to efficiently estimate the posterior distribution of the maximal value via in-context learning. We show how to extend this method to consider varying costs of pulling arms and to use different PFNs to model reward distributions individually per arm. Experimental results on one novel and two existing standard benchmark tasks demonstrate the superior performance of PS-PFN compared to other bandit and AutoML strategies.
Clustering by Attention: Leveraging Prior Fitted Transformers for Data Partitioning
Shokry, Ahmed, Khalafallah, Ayman
Clustering is a core task in machine learning with wide-ranging applications in data mining and pattern recognition. However, its unsupervised nature makes it inherently challenging. Many existing clustering algorithms suffer from critical limitations: they often require careful parameter tuning, exhibit high computational complexity, lack interpretability, or yield suboptimal accuracy, especially when applied to large-scale datasets. In this paper, we introduce a novel clustering approach based on meta-learning. Our approach eliminates the need for parameter optimization while achieving accuracy that outperforms state-of-the-art clustering techniques. The proposed technique leverages a few pre-clustered samples to guide the clustering process for the entire dataset in a single forward pass. Specifically, we employ a pre-trained Prior-Data Fitted Transformer Network (PFN) to perform clustering. The algorithm computes attention between the pre-clustered samples and the unclustered samples, allowing it to infer cluster assignments for the entire dataset based on the learned relation. We theoretically and empirically demonstrate that, given just a few pre-clustered examples, the model can generalize to accurately cluster the rest of the dataset. Experiments on challenging benchmark datasets show that our approach can successfully cluster well-separated data without any pre-clustered samples, and significantly improves performance when a few clustered samples are provided. We show that our approach is superior to the state-of-the-art techniques. These results highlight the effectiveness and scalability of our approach, positioning it as a promising alternative to existing clustering techniques.
Position: The Future of Bayesian Prediction Is Prior-Fitted
Müller, Samuel, Reuter, Arik, Hollmann, Noah, Rügamer, David, Hutter, Frank
Training neural networks on randomly generated artificial datasets yields Bayesian models that capture the prior defined by the dataset-generating distribution. Prior-data Fitted Networks (PFNs) are a class of methods designed to leverage this insight. In an era of rapidly increasing computational resources for pre-training and a near stagnation in the generation of new real-world data in many applications, PFNs are poised to play a more important role across a wide range of applications. They enable the efficient allocation of pre-training compute to low-data scenarios. Originally applied to small Bayesian modeling tasks, the field of PFNs has significantly expanded to address more complex domains and larger datasets. This position paper argues that PFNs and other amortized inference approaches represent the future of Bayesian inference, leveraging amortized learning to tackle data-scarce problems. We thus believe they are a fruitful area of research. In this position paper, we explore their potential and directions to address their current limitations.
TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks
While tabular classification has traditionally relied on from-scratch training, a recent breakthrough called prior-data fitted networks (PFNs) challenges this approach. Similar to large language models, PFNs make use of pretraining and in-context learning to achieve strong performance on new tasks in a single forward pass. However, current PFNs have limitations that prohibit their widespread adoption. Notably, TabPFN achieves very strong performance on small tabular datasets but is not designed to make predictions for datasets of size larger than 1000. In this work, we overcome these limitations and substantially improve the performance of PFNs via context optimization.
Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors
Nagler, Thomas, Rügamer, David
Prior-data fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular data sets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled and efficient sampling procedure to construct Bayesian posteriors for such estimates based on Martingale posteriors, and prove its convergence. Several simulated and real-world data examples showcase the uncertainty quantification of our method in inference applications.