symbolic expression
specifications
This section contains additional details on the object specifications. As mentioned in Section 3, we rely on the PB language to define the structure for each object type that we would like to handle with our model. Our framework supports all basic constructions of the language including nested messages and oneofclauses. For example, in Listing 1b, we can see that a generic Objectcan be either an entityor a constraint. We also use oneoffor objects that may appear in several mutually exclusive configurations (e.g., CircleArcEntityrepresents both arcs and closed circles and for the latter which it does not make sense to specify end points). We handle such constructions by injecting an additional token with the discrete value set to the index of the active field.
EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
Jiang, Nan, Wang, Ziyi, Xue, Yexiang
Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the effective search space and accelerating training lies in symbolic equivalence: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates equality graphs (e-graphs) into diverse symbolic regression algorithms, including Monte Carlo Tree Search (MCTS), deep reinforcement learning (DRL), and large language models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module, enabling more efficient learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalence classes in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Under mild assumptions, we show that embedding e-graphs tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances multiple baselines across challenging benchmarks, discovering equations with lower normalized mean squared error than state-of-the-art methods. Code implementation is available at: https://www.github.com/jiangnanhugo/egg-sr.
Uncovering Singularities in Feynman Integrals via Machine Learning
Liu, Yuanche, Xu, Yingxuan, Zhang, Yang
High-precision scattering amplitudes are crucial for testing the Standard Model at colliders and modeling gravitational waves from compact binaries. Upcoming experiments such as the HL-LHC, CEPC, FCC-ee, and third-generation gravitational-wave detectors will achieve unprecedented precision, demanding theoretical predictions of comparable accuracy, particularly in the form of accurate multi-loop scattering amplitudes. Around a decade ago, obtaining precise predictions for two-to-three particle collider processes beyond next-to-leading order was widely considered infeasible. This changed with advances in evaluating complicated two-loop Feynman integrals and interpreting them in terms of Chen's iterated integrals. Key steps include deriving and solving differential equations for master integrals and assembling full amplitudes, often with finite-field techniques. In this context, the concept of the symbol alphabet and associated function spaces has become central for multi-loop studies [1, 2]. These tools capture the algebraic structure of iterated integrals, first explored by Chen in the 1970s [3], which naturally arise in canonical-form differential equations [4] and can be expressed as nested d-log integrals.
Symbolic Regression and Differentiable Fits in Beyond the Standard Model Physics
AbdusSalam, Shehu, Abel, Steven, Bartlett, Deaglan, Romรฃo, Miguel Crispim
We demonstrate the efficacy of symbolic regression (SR) to probe models of particle physics Beyond the Standard Model (BSM), by considering the so-called Constrained Minimal Supersymmetric Standard Model (CMSSM). Like many incarnations of BSM physics this model has a number (four) of arbitrary parameters, which determine the experimental signals, and cosmological observables such as the dark matter relic density. We show that analysis of the phenomenology can be greatly accelerated by using symbolic expressions derived for the observables in terms of the input parameters. Here we focus on the Higgs mass, the cold dark matter relic density, and the contribution to the anomalous magnetic moment of the muon. We find that SR can produce remarkably accurate expressions. Using them we make global fits to derive the posterior probability densities of the CMSSM input parameters which are in good agreement with those performed using conventional methods. Moreover, we demonstrate a major advantage of SR which is the ability to make fits using differentiable methods rather than sampling methods. We also compare the method with neural network (NN) regression. SR produces more globally robust results, while NNs require data that is focussed on the promising regions in order to be equally performant.
Evaluating NLP Embedding Models for Handling Science-Specific Symbolic Expressions in Student Texts
Bleckmann, Tom, Tschisgale, Paul
In recent years, natural language processing (NLP) has become integral to educational data mining, particularly in the analysis of student-generated language products. For research and assessment purposes, so-called embedding models are typically employed to generate numeric representations of text that capture its semantic content for use in subsequent quantitative analyses. Y et when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing research studies and practical applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased research findings and diminished performance of practical applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: 1) similarity-based analyses and 2) integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI's GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Overall, this study underscores the importance for educational data mining researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions. The code and (partial) data are available at https: //doi.org/10.17605/OSF.IO/6XQVG.
Synthetic Series-Symbol Data Generation for Time Series Foundation Models
Wang, Wenxuan, Wu, Kai, Li, Yujian Betterest, Wang, Dan, Zhang, Xiaoyu
Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.