Cranmer, Miles
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
Ohana, Ruben, McCabe, Michael, Meyer, Lucas, Morel, Rudy, Agocs, Fruzsina J., Beneitez, Miguel, Berger, Marsha, Burkhart, Blakesley, Dalziel, Stuart B., Fielding, Drummond B., Fortunato, Daniel, Goldberg, Jared A., Hirashima, Keiya, Jiang, Yan-Fei, Kerswell, Rich R., Maddu, Suryanarayana, Miller, Jonah, Mukhopadhyay, Payel, Nixon, Stefan S., Shen, Jeff, Watteaux, Romain, Blancard, Bruno Régaldo-Saint, Rozet, François, Parker, Liam H., Cranmer, Miles, Ho, Shirley
Machine learning based surrogate models offer researchers powerful tools for accelerating simulation-based workflows. However, as standard datasets in this space often cover small classes of physical behavior, it can be difficult to evaluate the efficacy of new approaches. To address this gap, we introduce the Well: a large-scale collection of datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain experts and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite. To facilitate usage of the Well, we provide a unified PyTorch interface for training and evaluating models. We demonstrate the function of this library by introducing example baselines that highlight the new challenges posed by the complex dynamics of the Well.
SymbolFit: Automatic Parametric Modeling with Symbolic Regression
Tsoi, Ho Fung, Rankin, Dylan, Caillol, Cecile, Cranmer, Miles, Dasu, Sridhara, Duarte, Javier, Harris, Philip, Lipeles, Elliot, Loncar, Vladimir
We introduce SymbolFit, a framework that automates parametric modeling by using symbolic regression to perform a machine-search for functions that fit the data, while simultaneously providing uncertainty estimates in a single run. Traditionally, constructing a parametric model to accurately describe binned data has been a manual and iterative process, requiring an adequate functional form to be determined before the fit can be performed. The main challenge arises when the appropriate functional forms cannot be derived from first principles, especially when there is no underlying true closed-form function for the distribution. In this work, we address this problem by utilizing symbolic regression, a machine learning technique that explores a vast space of candidate functions without needing a predefined functional form, treating the functional form itself as a trainable parameter. Our approach is demonstrated in data analysis applications in high-energy physics experiments at the CERN Large Hadron Collider (LHC). We demonstrate its effectiveness and efficiency using five real proton-proton collision datasets from new physics searches at the LHC, namely the background modeling in resonance searches for high-mass dijet, trijet, paired-dijet, diphoton, and dimuon events. We also validate the framework using several toy datasets with one and more variables.
Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task
Golkar, Siavash, Bietti, Alberto, Pettee, Mariel, Eickenberg, Michael, Cranmer, Miles, Hirashima, Keiya, Krawezik, Geraud, Lourie, Nicholas, McCabe, Michael, Morel, Rudy, Ohana, Ruben, Parker, Liam Holden, Blancard, Bruno Régaldo-Saint, Cho, Kyunghyun, Ho, Shirley
Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.
Reusability report: Prostate cancer stratification with diverse biologically-informed neural architectures
Pedersen, Christian, Tesileanu, Tiberiu, Wu, Tinghui, Golkar, Siavash, Cranmer, Miles, Zhang, Zijun, Ho, Shirley
In Elmarakeby et al., "Biologically informed deep neural network for prostate cancer discovery", a feedforward neural network with biologically informed, sparse connections (P-NET) was presented to model the state of prostate cancer. We verified the reproducibility of the study conducted by Elmarakeby et al., using both their original codebase, and our own re-implementation using more up-to-date libraries. We quantified the contribution of network sparsification by Reactome biological pathways, and confirmed its importance to P-NET's superior performance. Furthermore, we explored alternative neural architectures and approaches to incorporating biological information into the networks. We experimented with three types of graph neural networks on the same training data, and investigated the clinical prediction agreement between different models. Our analyses demonstrated that deep neural networks with distinct architectures make incorrect predictions for individual patient that are persistent across different initializations of a specific neural architecture. This suggests that different neural architectures are sensitive to different aspects of the data, an important yet under-explored challenge for clinical prediction tasks.
AstroCLIP: Cross-Modal Pre-Training for Astronomical Foundation Models
Lanusse, Francois, Parker, Liam, Golkar, Siavash, Cranmer, Miles, Bietti, Alberto, Eickenberg, Michael, Krawezik, Geraud, McCabe, Michael, Ohana, Ruben, Pettee, Mariel, Blancard, Bruno Regaldo-Saint, Tesileanu, Tiberiu, Cho, Kyunghyun, Ho, Shirley
We present AstroCLIP, a strategy to facilitate the construction of astronomical foundation models that bridge the gap between diverse observational modalities. We demonstrate that a cross-modal contrastive learning approach between images and optical spectra of galaxies yields highly informative embeddings of both modalities. In particular, we apply our method on multi-band images and optical spectra from the Dark Energy Spectroscopic Instrument (DESI), and show that: (1) these embeddings are well-aligned between modalities and can be used for accurate cross-modal searches, and (2) these embeddings encode valuable physical information about the galaxies -- in particular redshift and stellar mass -- that can be used to achieve competitive zero- and few- shot predictions without further finetuning. Additionally, in the process of developing our approach, we also construct a novel, transformer-based model and pretraining approach for processing galaxy spectra.
xVal: A Continuous Number Encoding for Large Language Models
Golkar, Siavash, Pettee, Mariel, Eickenberg, Michael, Bietti, Alberto, Cranmer, Miles, Krawezik, Geraud, Lanusse, Francois, McCabe, Michael, Ohana, Ruben, Parker, Liam, Blancard, Bruno Régaldo-Saint, Tesileanu, Tiberiu, Cho, Kyunghyun, Ho, Shirley
Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that xVal is more token-efficient and demonstrates improved generalization.
Multiple Physics Pretraining for Physical Surrogate Models
McCabe, Michael, Blancard, Bruno Régaldo-Saint, Parker, Liam Holden, Ohana, Ruben, Cranmer, Miles, Bietti, Alberto, Eickenberg, Michael, Golkar, Siavash, Krawezik, Geraud, Lanusse, Francois, Pettee, Mariel, Tesileanu, Tiberiu, Cho, Kyunghyun, Ho, Shirley
We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling. MPP involves training large surrogate models to predict the dynamics of multiple heterogeneous physical systems simultaneously by learning features that are broadly useful across diverse physical tasks. In order to learn effectively in this setting, we introduce a shared embedding and normalization strategy that projects the fields of multiple systems into a single shared embedding space. We validate the efficacy of our approach on both pretraining and downstream tasks over a broad fluid mechanics-oriented benchmark. We show that a single MPP-pretrained transformer is able to match or outperform task-specific baselines on all pretraining sub-tasks without the need for finetuning. For downstream tasks, we demonstrate that finetuning MPP-trained models results in more accurate predictions across multiple time-steps on new physics compared to training from scratch or finetuning pretrained video foundation models. We open-source our code and model weights trained at multiple scales for reproducibility and community experimentation.
Symbolic Regression on FPGAs for Fast Machine Learning Inference
Tsoi, Ho Fung, Pol, Adrian Alan, Loncar, Vladimir, Govorkova, Ekaterina, Cranmer, Miles, Dasu, Sridhara, Elmer, Peter, Harris, Philip, Ojalvo, Isobel, Pierini, Maurizio
The high-energy physics community is investigating the feasibility of deploying machine-learning-based solutions on Field-Programmable Gate Arrays (FPGAs) to improve physics sensitivity while meeting data processing latency limitations. In this contribution, we introduce a novel end-to-end procedure that utilizes a machine learning technique called symbolic regression (SR). It searches equation space to discover algebraic relations approximating a dataset. We use PySR (software for uncovering these expressions based on evolutionary algorithm) and extend the functionality of hls4ml (a package for machine learning inference in FPGAs) to support PySR -generated expressions for resource-constrained production environments. Deep learning models often optimise the top metric by pinning the network size because vast hyperparameter space prevents extensive neural architecture search. Conversely, SR selects a set of models on the Pareto front, which allows for optimising the performanceresource tradeoff directly. By embedding symbolic forms, our implementation can dramatically reduce the computational resources needed to perform critical tasks. We validate our procedure on a physics benchmark: multiclass classification of jets produced in simulated proton-proton collisions at the CERN Large Hadron Collider, and show that we approximate a 3-layer neural network with an inference model that has as low as 5 ns execution time (a reduction by a factor of 13) and over 90% approximation accuracy.
Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl
Cranmer, Miles
PySR is an open-source library for practical symbolic regression, a type of machine learning which aims to discover human-interpretable symbolic models. PySR was developed to democratize and popularize symbolic regression for the sciences, and is built on a high-performance distributed back-end, a flexible search algorithm, and interfaces with several deep learning packages. PySR's internal search algorithm is a multi-population evolutionary algorithm, which consists of a unique evolve-simplify-optimize loop, designed for optimization of unknown scalar constants in newly-discovered empirical expressions. PySR's backend is the extremely optimized Julia library SymbolicRegression.jl, which can be used directly from Julia. It is capable of fusing user-defined operators into SIMD kernels at runtime, performing automatic differentiation, and distributing populations of expressions to thousands of cores across a cluster. In describing this software, we also introduce a new benchmark, "EmpiricalBench," to quantify the applicability of symbolic regression algorithms in science. This benchmark measures recovery of historical empirical equations from original and synthetic datasets.
The SZ flux-mass ($Y$-$M$) relation at low halo masses: improvements with symbolic regression and strong constraints on baryonic feedback
Wadekar, Digvijay, Thiele, Leander, Hill, J. Colin, Pandey, Shivam, Villaescusa-Navarro, Francisco, Spergel, David N., Cranmer, Miles, Nagai, Daisuke, Anglés-Alcázar, Daniel, Ho, Shirley, Hernquist, Lars
Feedback from active galactic nuclei (AGN) and supernovae can affect measurements of integrated SZ flux of halos ($Y_\mathrm{SZ}$) from CMB surveys, and cause its relation with the halo mass ($Y_\mathrm{SZ}-M$) to deviate from the self-similar power-law prediction of the virial theorem. We perform a comprehensive study of such deviations using CAMELS, a suite of hydrodynamic simulations with extensive variations in feedback prescriptions. We use a combination of two machine learning tools (random forest and symbolic regression) to search for analogues of the $Y-M$ relation which are more robust to feedback processes for low masses ($M\lesssim 10^{14}\, h^{-1} \, M_\odot$); we find that simply replacing $Y\rightarrow Y(1+M_*/M_\mathrm{gas})$ in the relation makes it remarkably self-similar. This could serve as a robust multiwavelength mass proxy for low-mass clusters and galaxy groups. Our methodology can also be generally useful to improve the domain of validity of other astrophysical scaling relations. We also forecast that measurements of the $Y-M$ relation could provide percent-level constraints on certain combinations of feedback parameters and/or rule out a major part of the parameter space of supernova and AGN feedback models used in current state-of-the-art hydrodynamic simulations. Our results can be useful for using upcoming SZ surveys (e.g., SO, CMB-S4) and galaxy surveys (e.g., DESI and Rubin) to constrain the nature of baryonic feedback. Finally, we find that the an alternative relation, $Y-M_*$, provides complementary information on feedback than $Y-M$