Genre
NeuralSurv: Deep Survival Analysis with Bayesian Uncertainty Quantification
We introduce, the first deep survival model to incorporate Bayesian uncertainty quantification. Our non parametric, architecture agnostic framework flexibly captures time varying covariate-risk relationships in continuous time via a novel two stage data augmentation scheme, for which we establish theoretical guarantees. For efficient posterior inference, we introduce a mean field variational algorithm with coordinate ascent updates that scale linearly in model size. By locally linearizing the Bayesian neural network, we obtain full conjugacy and derive all coordinate updates in closed form. In experiments, delivers superior calibration compared to state-of-the-art deep survival models, while matching or exceeding their discriminative performance across both synthetic benchmarks and real-world datasets. Our results demonstrate the value of Bayesian principles in data scarce regimes by enhancing model calibration and providing robust, well calibrated uncertainty estimates for the survival function.
Large Language Diffusion Models
The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in and, after SFT, exhibits impressive abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs.
Structure-Aware Spectral Sparsification via Uniform Edge Sampling
Spectral clustering is a fundamental method for graph partitioning, but its reliance on eigenvector computation limits scalability to massive graphs. Classical sparsification methods preserve spectral properties by sampling edges proportionally to their effective resistances, but require expensive preprocessing to estimate these resistances. We study whether uniform edge sampling--a simple, structure-agnostic strategy--can suffice for spectral clustering. Our main result shows that for graphs admitting a well-separated $k$-clustering, characterized by a large structure ratio $\Upsilon(k) = \lambda_{k+1} / \rho_G(k)$, uniform sampling preserves the spectral subspace used for clustering. Specifically, we prove that uniformly sampling $O(\gamma^2 n \log n / \varepsilon^2)$ edges, where $\gamma$ is the Laplacian condition number, yields a sparsifier whose top $(n-k)$-dimensional eigenspace is approximately orthogonal to the cluster indicators.
Theory of learning of high-dimensional controlled non-linear dynamical systems (I): models and methods
Neural ordinary differential equations (neural ODEs) have rapidly gained prominence as a powerful and unifying framework for conceptualizing artificial neural networks, elegantly connecting the continuous-time modeling of dynamical systems with the discrete, data-driven paradigm of modern deep learning. Beyond their practical advantages they offer fresh theoretical insights into the training and generalization properties of neural networks. The distinctive feature of this framework is its dual dynamical nature: inference dynamics, which govern the ODE evolution during forward computation, and training dynamics, which control the optimization of model parameters. This makes neural ODEs a particularly well-suited theoretical framework for studying a large variety of settings such as multi-layer neural networks (ResNets for example), autoregressive models (with next-token generation dynamics), generative models, and recurrent neural networks in theoretical neuroscience. In this work, we introduce a theoretically grounded class of models for studying neural ODEs trained via online stochastic gradient descent. We solve the training dynamics of these models via dynamical mean field theory and derive learning curves in the high-dimensional limit.
On McDiarmid's Inequality under Dependence via Approximate Tensorization of Entropy
We argue that dependent versions of McDiarmid's inequality are a useful but underutilized tool in mathematical statistics, learning theory and theoretical computer science. To make this point, we first highlight that approximate tensorization of entropy (ATE) implies McDiarmid's via the Entropy Method. Second, we derive McDiarmid's inequality for non-isotropic Gaussian random vectors $X \sim \mathcal N(ฮผ, ฮฃ)$ through ATE with a constant of the order of the condition number of $ฮฃ$. We both independently obtain this ATE through a simple application of stochastic localization and also discuss how a more general ATE for the Gibbs sampler due to Ascolani et al., 2026 generalizes McDiarmid's-like concentration to strongly log-concave and log-smooth probability measures. We then apply the resulting concentration inequalities to resolve a question on the concentration of $\operatorname{sign}(X)$ posed by Simone Bombari, investigate Erdลs-Rรฉnyi graphs under dependence and prove a Dvoretzky-Kiefer-Wolfowitz-type inequality for observations from a joint measure fulfilling ATE and continuous marginal CDFs. For the class of strongly log-concave and log-smooth measures, this result improves upon a prior Dvoretzky-Kiefer-Wolfowitz-type inequality for non-i.i.d. observations due to Bobkov and Gรถtze, 2010, by establishing the expected $1/\sqrt{n}$-rate of convergence under weak dependence instead of $n^{-1/3}$.
Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering
Cortese, Federico P., Farcomeni, Alessio
A penalty is used to encourage smoothness of transitions over time, while robustness is achieved throughthe use of aTukey's biweight loss function. Anadditional parameter controls the variability of feature weights across states, allowing the model to assign state-specific relevance to each feature. We illustrate in simulation how the method accurately recovers the true cluster sequence and reliably identifies relevant features, outperforming competing approaches, particularly in the presence of outliers. We conclude with two empirical applications, one on the number of conflict-related homicides in Kosovo in the period 1998-2000, and another on macroeconomic performance of twelve European countries in the period 1949-2024.Keywords: Dissimilarity-based clustering, regime-switching models, time series analysis, unsupervised learning, variable importance.
Computationally tractable robust differentially private mean estimation
We develop a new, differentially private mean estimator called the balloon mean. The main features of the balloon mean are that it is computationally tractable and enjoys robustness to outlying observations. It is based on an iterative clipping procedure over expanding Mahalanobis balls, or ``balloons.'' The method satisfies zero-concentrated differential privacy and depends on a small number of interpretable tuning parameters. We provide theoretical guarantees under heavy-tailed and contaminated elliptical models, characterizing its statistical performance and robustness to outliers. Extensive simulations demonstrate that the balloon mean is robust to heavy-tailed and contaminated data, and outperforms existing differentially private mean estimators in contaminated settings.
Reliability of Probabilistic Emulation of Physical Systems
Greenbury, Sam F., Jersakova, Radka, Conti, Paolo, Famili, Marjan, Sprague, Christopher Iliffe, Brown, Edwin, McEwen, Jason D.
Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.
A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k -- the Selection Class -- depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at https://github.com/tiexinding/NPM-Weibull-public .
Diffusion-Network Alignment: An Efficient Algorithm and Explicit Probability Bounds
This paper studies a variation of the classic network alignment problem, named diffusion-network alignment. The goal is to align the vertices of a rooted diffusion tree to the vertices of a network, where the diffusion tree could be from a communication trace or contact tracing, and the network could be an online or offline social network. Different from the classic network alignment where both networks are fully observed, this model captures the information asymmetry of two networks. To solve this problem, this paper presents an efficient algorithm based on tree correlation tests to extract alignment information from local neighborhoods. We analyze the performance of the algorithm in the sparse graph regime and show that with high probability, all matched pairs are correct. Furthermore, for each vertex on the diffusion tree, this paper establishes an explicit lower bound on the probability that the vertex is correctly matched. These lower bounds are depth-dependent and increase as vertices get closer to the root.