Statistical Learning
Inferring Asteroseismic Parameters from Short Observations Using Deep Learning: Application to TESS and K2 Red Giants
Ghanghas, Nipun, Dhanpal, Siddharth, Hanasoge, Shravan, Netrapalli, Praneeth, Shanmugam, Karthikeyan
Asteroseismology is the study of resonant oscillations of stars to infer their internal structure and dynamics. It is also a powerful tool for precisely determining stellar parameters such as mass, radius, surface gravity, and age. The ongoing TESS mission, with its nearly complete sky coverage, presents a unique opportunity to uniformly probe stellar populations across the Milky Way. TESS is estimated to have observed more than 300,000 oscillating red giants, most of which have one to two months of observations. Given the scale of this dataset, we need a fast, efficient, and robust way to analyse the data. In this work, our objective is to develop a machine learning (ML) based method to infer asteroseismic parameters from short-duration observations. Specifically, we focus on two global seismic parameters, the large frequency separation ($ฮฮฝ$) and the frequency at maximum power ($ฮฝ_{\mathrm{max}}$), from one-month-long TESS observations of red giants. Meanwhile, for K2 data, our focus extends to inferring the period spacings of dipolar gravity modes ($ฮฮ _{1}$), in addition to $ฮฮฝ$ and $ฮฝ_{\mathrm{max}}$. Our findings demonstrate that our machine learning algorithm can accurately infer $ฮฮฝ$ and $ฮฝ_{\mathrm{max}}$ for approximately 50% of samples created by taking one-month Kepler and K2 observations. For TESS one sector data however, we recover reliable $ฮฮฝ$ for only about 23% of the stars. Additionally, we get reliable $ฮฮ _{1}$ inferences for about 200 young red-giants from K2. For these $ฮฮ _{1}$ inferences, we see a good match with the well known $ฮฮฝ-ฮฮ _{1}$ degenerate sequence observed in Kepler red-giants.
Empirical Bayes Rebiasing
Ling, Wanyi, Li, Sida, Guan, Junming, Ignatiadis, Nikolaos
We study methods for simultaneous analysis of many noisy and biased estimates, each paired with an even noisier estimate of its own bias. The analyst's goal is to construct short calibrated intervals for each parameter. The standard debiasing approach, which subtracts the bias estimate from each biased estimate, inflates variance and yields long intervals. In this paper, we propose an empirical Bayes rebiasing strategy that starts from the fully debiased estimates and learns from data how much bias to reintroduce by estimating the unknown bias distribution. We provide convergence rates for the coverage of our intervals when the bias distribution is estimated using nonparametric maximum likelihood. Furthermore, we demonstrate substantial precision gains in prediction-powered inference, including pairwise LLM win-rate evaluations, as well as for inference of direct genetic effects in family-based GWAS.
Horseshoe Forests for High-Dimensional Causal Survival Analysis
Jacobs, Tijn, van Wieringen, Wessel N., van der Pas, Stรฉphanie L.
We develop a Bayesian tree ensemble model to estimate heterogeneous treatment effects in censored survival data with high-dimensional covariates. Instead of imposing sparsity through the tree structure, we place a horseshoe prior directly on the step heights to achieve adaptive global-local shrinkage. This strategy allows flexible regularisation and reduces noise. We develop a reversible jump Gibbs sampler to accommodate the non-conjugate horseshoe prior within the tree ensemble framework. We show through extensive simulations that the method accurately estimates treatment effects in high-dimensional covariate spaces, at various sparsity levels, and under non-linear treatment effect functions. We further illustrate the practical utility of the proposed approach by a re-analysis of pancreatic ductal adenocarcinoma (PDAC) survival data from The Cancer Genome Atlas.
A Novel Computational Framework for Causal Inference: Tree-Based Discretization with ILP-Based Matching
Yang, Tianyu, Noor-E-Alam, Md.
Causal inference is essential for data-driven decision-making, as it aims to uncover causal relationships from observational data. However, identifying causality remains challenging due to the potential for confounding and the distinction between correlation and causation. While recent advances in causal machine learning and matching algorithms have improved estimation accuracy, these methods often face trade-offs between interpretability and computational efficiency. This paper proposes a novel approach that combines a tree-based discretization technique, tailored for causal inference, with an integer linear programming-based matching algorithm. The discretization ensures approximately linear relationships for control datasets within strata, enabling effective matching, while the optimization framework optimizes for global balance. The resulting algorithm yields computational efficiency and less biased ATT estimates compared to state-of-the-art algorithms. Empirical evaluations demonstrate the proposed method's practical advantages over existing techniques in causal inference scenarios.
PRCD-MAP: Learning How Much to Trust Imperfect Priors in Causal Discovery
External priors of unknown reliability create a brittle trade-off in causal discovery: blind trust amplifies errors, blind rejection wastes signal. Real priors are also heterogeneously reliable -- physical laws are trustworthy, LLM-suggested edges are speculative -- yet existing methods either ignore priors or impose them through globally uniform trust. We propose PRCD-MAP, a soft prior-consumption layer that assigns per-edge trust to an imperfect prior and uses it to modulate a prior-aware $\ell_1$ and prior-weighted $\ell_2$ regularizer in a MAP objective. Trust is calibrated by empirical Bayes on a Laplace-approximated marginal likelihood and propagated along the prior graph by an MLP, so data-confirmed neighborhoods boost trust and contradictions suppress it. PRCD-MAP enjoys a population-level safety guarantee: it is $\varepsilon$-safe in expectation over the prior-generation distribution, with $\varepsilon\leq C\cdot\mathrm{acc}(1{-}\mathrm{acc})\cdot d^2/T$ at the parametric $T^{-1}$ rate and vanishing at the prior-quality endpoints. When the prior is uninformative, learned trust provably collapses to its floor and the method recovers a no-prior baseline. Empirically, on real CausalTime data PRCD-MAP exploits informative LLM priors (LLM-prior gain $+0.067/+0.089$ AUROC on AQI/Medical over a no-prior PRCD-MAP backbone; combined backbone+prior lead $+0.123/+0.043$ over PCMCI+), auto-attenuates on the anonymous-variable Traffic stress test, and retains a lead at $d{=}300$; against BayesDAG, the closest soft-Bayesian baseline, PRCD-MAP wins on every CausalTime dataset under a matched $W_0$-only protocol. A four-way ablation isolates each component: EB calibration and MLP trust propagation jointly carry the plurality of the gain, with positive sign on every dataset. Extensions to nonlinear (NAM) and cross-sectional settings show the calibrated-trust principle is setting-agnostic.
Dynamic Vine Copulas: Detecting and Quantifying Time-Varying Higher-Order Interactions
Safaai, Houman, Vargas, Alessandro Marin
Time-varying dependence is often modeled with dynamic correlations or Gaussian graphical models, but multivariate systems can change through tail behavior, asymmetry, or conditional structure even when correlations are nearly stable. We introduce Dynamic Vine Copulas (DVC), a temporal vine-copula framework for estimating and diagnosing sequence-wide non-Gaussian dependence. DVC fixes a chosen vine factorization for comparability; the framework applies to C-, D-, and R-vines, and our experiments use fixed-root-order C-vines. Pair-copula states evolve through smooth parameter trajectories or temporally regularized family-switching paths. The main diagnostic is a held-out comparison between a full vine and its matched 1-truncated version, which separates flexible first-tree pairwise dependence from evidence contributed by higher-tree conditional terms. At the population level, under a correct fixed vine and the simplifying assumption, this contrast equals the higher-tree component of a vine total-correlation decomposition; in finite samples, it is a predictive diagnostic. In controlled benchmarks, DVC detects Student-t degrees-of-freedom changes, Clayton-to-Gumbel switches, and recurrent conditional-interaction episodes missed or conflated by Gaussian dynamic baselines. The higher-tree score remains near zero in pairwise-only regimes and rises during conditional-interaction regimes. On Allen Visual Behavior Neuropixels data, DVC identifies a reproducible time-indexed higher-tree signal that is positive across held-out splits and vanishes under a decorrelated null, indicating simultaneous cross-area dependence. DVC therefore provides a flexible temporal copula model and an interpretable test of whether temporal dependence changes are pairwise or conditional.
Forecasting Oncology Demand Trends with Boosting-Based Bayesian Conjugate Models
Neto, Ademir Batista dos Santos, Ferreira, Tiago Alessandro Espinola, Firmino, Paulo Renato Alves
Accurate trend forecasting in healthcare time series is essential for planning and resource allocation. This paper proposes a Bayesian framework for predicting oncology demand trends, modeling weekly appointments as a Poisson process with a Gamma prior to the demand rate. To enhance adaptability and capture persistent directional patterns, we incorporate a residual-based boosting mechanism grounded in a Gamma-Log-Normal conjugate structure. This boosting approach allows the model to track both short- and long-term trend shifts while maintaining the analytical tractability of conjugate Bayesian updating. The methodology was evaluated on real oncology service data from Cariri, Ceara, Brazil, and compared against established baselines, including linear regression, ARIMA, naive forecasting, LSTM neural networks, and XGBoost. Results showed that the proposed model outperforms competing methods in trend detection accuracy, with gains in terms of percentage of correct direction of 38.25% in relation to the second best approach in some cases.
Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors
Moufad, Badr, Ilina, Albina, Habi, Hai Victor, Lahlou, Salem, Janati, Yazid, Messer, Hagit, Moulines, Eric
Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce path-integrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect line integration relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.
In-Context Positive-Unlabeled Learning
Liu, Siyan, Chang, Yi, Cheng, Manli, Tian, Qinglong, Li, Pengfei
Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific training or iterative optimization, which limits their applicability when many tasks must be solved quickly or with little tuning. We introduce PUICL, a pretrained transformer that solves PU classification entirely through in-context learning. PUICL is pretrained on synthetic PU datasets generated from randomly instantiated structural causal models, exposing it to a wide range of feature-label relationships and class-prior configurations. At inference time, PUICL receives the labeled positives and the unlabeled samples as a single input and returns class probabilities for the unlabeled rows in one forward pass, with no gradient updates or per-task fitting. On 20 semi-synthetic PU benchmarks derived from the UCI Machine Learning Repository, OpenML, and scikit-learn, PUICL outperforms four standard PU learning baselines in average AUC and accuracy, and is competitive on F1-score. These results show that the in-context learning paradigm extends naturally beyond fully supervised tabular prediction to the semi-supervised PU setting.
Variational Smoothing and Inference for SDEs from Sparse Data with Dynamic Neural Flows
Stochastic differential equations (SDEs) provide a flexible framework for modeling temporal dynamics in partially observed systems. A central task is to calibrate such models from data, which requires inferring latent trajectories and parameters from sparse, noisy observations. Classical smoothing methods for this problem are often limited by path degeneracy and poor scalability. In this work, we developed a novel method based on characterization of the posterior SDE in terms of conditional backward-in-time score defined as the gradient of a function solving a Kolmogorov backward equation with multiplicative updates at observation times. We learn this conditional score using neural networks trained to satisfy both the governing PDE and the observation-induced jump conditions, thereby integrating continuous-time dynamics with discrete Bayesian updates. The resulting score induces a posterior SDE with the same diffusion coefficient but a modified drift, enabling efficient posterior trajectory sampling. We further derive a likelihood-based objective for learning the SDE parameters, yielding an evidence lower bound (ELBO) for joint state smoothing and parameter estimation. This leads to a variational EM-style procedure, where the neural conditional score is optimized to approximate the smoothing distribution, followed by a maximization step over the SDE parameters using samples from the induced posterior. Experiments on nonlinear systems demonstrate accurate and stable inference with a very few observations demonstrating significant improved scalability compared to classical MCMC methods.