relative error
Solving and Learning Partial Differential Equations with Variational Q-Exponential Processes
Solving and learning partial differential equations (PDEs) lies at the core of physicsinformed machine learning. Traditional numerical methods, such as finite difference and finite element approaches, are rooted in domain-specific techniques and often lack scalability. Recent advances have introduced neural networks and Gaussian processes (GPs) as flexible tools for automating PDE solving and incorporating physical knowledge into learning frameworks. While GPs offer tractable predictive distributions and a principled probabilistic foundation, they may be suboptimal in capturing complex behaviors such as sharp transitions or non-smooth dynamics. To address this limitation, we propose the use of the q-exponential process (Q-EP), a recently developed generalization of GPs designed to better handle data with abrupt changes and to more accurately model derivative information. We advocate for Q-EP as a superior alternative to GPs in solving PDEs and associated inverse problems. Leveraging sparse variational inference, our method enables principled uncertainty quantification - a capability not naturally afforded by neural network-based approaches. Through a series of experiments, including the Eikonal equation, Burgers' equation, and an inverse Darcy flow problem, we demonstrate that the variational Q-EP method consistently yields more accurate solutions while providing meaningful uncertainty estimates.
Predictable Scale (Part II) -- Farseer: ARefined Scaling Law in LLMs
Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface L(N,D), Farseer achieves a significantly better fit to empirical data than prior laws (e.g., Chinchilla's law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, outperforming Chinchilla's law, whose extrapolation error is 433% higher. This allows for the reliable evaluation of competing training strategies across all (N,D) settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIAH100 GPU hours. To foster further research, we are comprehensively open-sourcing all code, data, results 3, all training logs4, all models used in scaling law fitting 5.
Mixture-of-Experts Operator Transformer for Large-Scale PDEPre-Training
Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equationspecific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts.
Inferring Asteroseismic Parameters from Short Observations Using Deep Learning: Application to TESS and K2 Red Giants
Ghanghas, Nipun, Dhanpal, Siddharth, Hanasoge, Shravan, Netrapalli, Praneeth, Shanmugam, Karthikeyan
Asteroseismology is the study of resonant oscillations of stars to infer their internal structure and dynamics. It is also a powerful tool for precisely determining stellar parameters such as mass, radius, surface gravity, and age. The ongoing TESS mission, with its nearly complete sky coverage, presents a unique opportunity to uniformly probe stellar populations across the Milky Way. TESS is estimated to have observed more than 300,000 oscillating red giants, most of which have one to two months of observations. Given the scale of this dataset, we need a fast, efficient, and robust way to analyse the data. In this work, our objective is to develop a machine learning (ML) based method to infer asteroseismic parameters from short-duration observations. Specifically, we focus on two global seismic parameters, the large frequency separation ($ฮฮฝ$) and the frequency at maximum power ($ฮฝ_{\mathrm{max}}$), from one-month-long TESS observations of red giants. Meanwhile, for K2 data, our focus extends to inferring the period spacings of dipolar gravity modes ($ฮฮ _{1}$), in addition to $ฮฮฝ$ and $ฮฝ_{\mathrm{max}}$. Our findings demonstrate that our machine learning algorithm can accurately infer $ฮฮฝ$ and $ฮฝ_{\mathrm{max}}$ for approximately 50% of samples created by taking one-month Kepler and K2 observations. For TESS one sector data however, we recover reliable $ฮฮฝ$ for only about 23% of the stars. Additionally, we get reliable $ฮฮ _{1}$ inferences for about 200 young red-giants from K2. For these $ฮฮ _{1}$ inferences, we see a good match with the well known $ฮฮฝ-ฮฮ _{1}$ degenerate sequence observed in Kepler red-giants.
Ratio-based Loss Functions
Helgerth, Lena, Christmann, Andreas
Algorithms in machine learning and AI do critically depend on at least three key components: (i) the risk function, which is the expectation of the loss function, (ii) the function space, which is often called the hypothesis space, and (iii) the set of probability measures, which are allowed for the specified algorithm. This paper gives a survey of a certain class of loss functions, which we call ratio-based. In supervised learning, margin-based loss functions for classification tasks depending on the product of the output values $y_i$ and the predictions $f(x_i)$ as well as distance-based loss functions depending on the difference of $y_i$ and $f(x_i)$ for regression are common. Distance-based loss functions are in particular useful, if an additive model assumption seems plausible, i.e. the common signal plus noise assumption. However, in the literature, several loss functions proposed for regression purposes have a multiplicative error structure in mind and pay attention to relative errors, i.e. to the ratio of $y_i$ and $f(x_i)$. In this survey article, we systematically investigate such ratio-based loss functions and propose a few new losses, which may be interesting for future research. We concentrate on investigating general properties of ratio-based loss functions like continuity, Lipschitz-continuity, convexity, and differentiability, because these properties play a central role in most machine learning algorithms. Therefore, we do not focus on some specific machine learning algorithm to derive universal consistency, learning rates, or stability results. Instead, we want to enable future research in this direction.
Amortized Variational Inference for Joint Posterior and Predictive Distributions in Bayesian Uncertainty Quantification
Bayesian predictive inference propagates parameter uncertainty to quantities of interest through the posterior-predictive distribution. In practice, this is typically performed using a two-stage procedure: first approximating the posterior distribution of model parameters, and then propagating posterior samples through the predictive model via Monte Carlo simulation. This sequential workflow can be computationally demanding, particularly for high-fidelity models such as those governed by partial differential equations. We propose a variational Bayesian framework that directly targets the posterior-predictive distribution and jointly learns variational approximations of both the posterior and the corresponding predictive distribution. The formulation introduces a variational upper bound on the Kullback--Leibler divergence together with moment-based regularization terms. The variational distributions are trained in an amortized manner, shifting computational effort to an offline stage and enabling efficient online inference. Numerical experiments ranging from analytical benchmarks to a finite-element solid mechanics problem demonstrate that the proposed method achieves more accurate predictive distributions than conventional two-stage variational inference, while substantially reducing the cost of online predictive inference.
ANear-Linear Time Algorithm for the Chamfer Distance
Further, the Chamfer distance is often used as a proxy for the more computationally demanding Earth-Mover (Optimal Transport) Distance. However, the quadratic dependence on n in the running time makes the naive approach intractable for large datasets. We overcome this bottleneck and present the first (1+")-approximate algorithm for estimating the Chamfer distance with a near-linear running time. Specifically, our algorithm runs in time O ndlog(n)/"2 and is implementable. Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets. We believe that our algorithm will open new avenues for analyzing large highdimensional point clouds. We also give evidence that if the goal is to report a (1+")-approximate mapping from A to B (as opposed to just its value), then any sub-quadratic time algorithm is unlikely to exist.
Accelerated Training of Physics-Informed Neural Networks (PINNs) using Meshless Discretizations
Physics-informed neural networks (PINNs) are neural networks trained by using physical laws in the form of partial differential equations (PDEs) as soft constraints. We present a new technique for the accelerated training of PINNs that combines modern scientific computing techniques with machine learning: discretely-trained PINNs (DT-PINNs). The repeated computation of the partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. While in principle any high-order discretization may be used, the use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries.