Plotting

 arXiv.org Machine Learning


Neural Bayes inference for complex bivariate extremal dependence models

arXiv.org Machine Learning

Likelihood-free approaches are appealing for performing inference on complex dependence models, either because it is not possible to formulate a likelihood function, or its evaluation is very computationally costly. This is the case for several models available in the multivariate extremes literature, particularly for the most flexible tail models, including those that interpolate between the two key dependence classes of `asymptotic dependence' and `asymptotic independence'. We focus on approaches that leverage neural networks to approximate Bayes estimators. In particular, we explore the properties of neural Bayes estimators for parameter inference for several flexible but computationally expensive models to fit, with a view to aiding their routine implementation. Owing to the absence of likelihood evaluation in the inference procedure, classical information criteria such as the Bayesian information criterion cannot be used to select the most appropriate model. Instead, we propose using neural networks as neural Bayes classifiers for model selection. Our goal is to provide a toolbox for simple, fast fitting and comparison of complex extreme-value dependence models, where the best model is selected for a given data set and its parameters subsequently estimated using neural Bayes estimation. We apply our classifiers and estimators to analyse the pairwise extremal behaviour of changes in horizontal geomagnetic field fluctuations at three different locations.


Data-driven Seasonal Climate Predictions via Variational Inference and Transformers

arXiv.org Machine Learning

Most operational climate services providers base their seasonal predictions on initialised general circulation models (GCMs) or statistical techniques that fit past observations. GCMs require substantial computational resources, which limits their capacity. In contrast, statistical methods often lack robustness due to short historical records. Recent works propose machine learning methods trained on climate model output, leveraging larger sample sizes and simulated scenarios. Yet, many of these studies focus on prediction tasks that might be restricted in spatial extent or temporal coverage, opening a gap with existing operational predictions. Thus, the present study evaluates the effectiveness of a methodology that combines variational inference with transformer models to predict fields of seasonal anomalies. The predictions cover all four seasons and are initialised one month before the start of each season. The model was trained on climate model output from CMIP6 and tested using ERA5 reanalysis data. We analyse the method's performance in predicting interannual anomalies beyond the climate change-induced trend. We also test the proposed methodology in a regional context with a use case focused on Europe. While climate change trends dominate the skill of temperature predictions, the method presents additional skill over the climatological forecast in regions influenced by known teleconnections. We reach similar conclusions based on the validation of precipitation predictions. Despite underperforming SEAS5 in most tropics, our model offers added value in numerous extratropical inland regions. This work demonstrates the effectiveness of training generative models on climate model output for seasonal predictions, providing skilful predictions beyond the induced climate change trend at time scales and lead times relevant for user applications.


Nearest Neighbour Equilibrium Clustering

arXiv.org Machine Learning

A novel and intuitive nearest neighbours based clustering algorithm is introduced, in which a cluster is defined in terms of an equilibrium condition which balances its size and cohesiveness. The formulation of the equilibrium condition allows for a quantification of the strength of alignment of each point to a cluster, with these cluster alignment strengths leading naturally to a model selection criterion which renders the proposed approach fully automatable. The algorithm is simple to implement and computationally efficient, and produces clustering solutions of extremely high quality in comparison with relevant benchmarks from the literature. R code to implement the approach is available from https://github.com/DavidHofmeyr/ I. Introduction Clustering, or cluster analysis, is the task of partitioning a set of data into groups, or clusters, which are seen to be relatively more homogeneous than the data as a whole. Clustering is one of the fundamental data analytic tasks, and forms an integral component of exploratory data analysis. Clustering is also of arguably increasing relevance, as data are increasingly being collected/generated from automated processes, where typically very little prior knowledge is available, making exploratory methods a necessity. In the classical clustering problem there is no explicit information about how the data should be grouped, and various interpretations of how clusters of points may be defined have led to the development of a very large number of methods for identifying them. Almost universally, however, clusters are determined from the geometric properties of the data, with pairs of points which are near to one another typically being seen as likely to be in the same cluster and pairs which are distant more likely to be in different clusters.


Nested Stochastic Gradient Descent for (Generalized) Sinkhorn Distance-Regularized Distributionally Robust Optimization

arXiv.org Machine Learning

Distributionally robust optimization (DRO) is a powerful technique to train robust models against data distribution shift. This paper aims to solve regularized nonconvex DRO problems, where the uncertainty set is modeled by a so-called generalized Sinkhorn distance and the loss function is nonconvex and possibly unbounded. Such a distance allows to model uncertainty of distributions with different probability supports and divergence functions. For this class of regularized DRO problems, we derive a novel dual formulation taking the form of nested stochastic programming, where the dual variable depends on the data sample. To solve the dual problem, we provide theoretical evidence to design a nested stochastic gradient descent (SGD) algorithm, which leverages stochastic approximation to estimate the nested stochastic gradients. We study the convergence rate of nested SGD and establish polynomial iteration and sample complexities that are independent of the data size and parameter dimension, indicating its potential for solving large-scale DRO problems. We conduct numerical experiments to demonstrate the efficiency and robustness of the proposed algorithm.


Unveiling the Power of Uncertainty: A Journey into Bayesian Neural Networks for Stellar dating

arXiv.org Machine Learning

Context: Astronomy and astrophysics demand rigorous handling of uncertainties to ensure the credibility of outcomes. The growing integration of artificial intelligence offers a novel avenue to address this necessity. This convergence presents an opportunity to create advanced models capable of quantifying diverse sources of uncertainty and automating complex data relationship exploration. What: We introduce a hierarchical Bayesian architecture whose probabilistic relationships are modeled by neural networks, designed to forecast stellar attributes such as mass, radius, and age (our main target). This architecture handles both observational uncertainties stemming from measurements and epistemic uncertainties inherent in the predictive model itself. As a result, our system generates distributions that encapsulate the potential range of values for our predictions, providing a comprehensive understanding of their variability and robustness. Methods: Our focus is on dating main sequence stars using a technique known as Chemical Clocks, which serves as both our primary astronomical challenge and a model prototype. In this work, we use hierarchical architectures to account for correlations between stellar parameters and optimize information extraction from our dataset. We also employ Bayesian neural networks for their versatility and flexibility in capturing complex data relationships. Results: By integrating our machine learning algorithm into a Bayesian framework, we have successfully propagated errors consistently and managed uncertainty treatment effectively, resulting in predictions characterized by broader uncertainty margins. This approach facilitates more conservative estimates in stellar dating. Our architecture achieves age predictions with a mean absolute error of less than 1 Ga for the stars in the test dataset.


A Comprehensive Benchmark for RNA 3D Structure-Function Modeling

arXiv.org Machine Learning

The RNA structure-function relationship has recently garnered significant attention within the deep learning community, promising to grow in importance as nucleic acid structure models advance. However, the absence of standardized and accessible benchmarks for deep learning on RNA 3D structures has impeded the development of models for RNA functional characteristics. In this work, we introduce a set of seven benchmarking datasets for RNA structure-function prediction, designed to address this gap. Our library builds on the established Python library rnaglib, and offers easy data distribution and encoding, splitters and evaluation methods, providing a convenient all-in-one framework for comparing models. Datasets are implemented in a fully modular and reproducible manner, facilitating for community contributions and customization. Finally, we provide initial baseline results for all tasks using a graph neural network. Source code: https://github.com/cgoliver/rnaglib Documentation: https://rnaglib.org


tempdisagg: A Python Framework for Temporal Disaggregation of Time Series Data

arXiv.org Machine Learning

tempdisagg is a modern, extensible, and production-ready Python framework for temporal disaggregation of time series data. It transforms low-frequency aggregates into consistent, high-frequency estimates using a wide array of econometric techniques-including Chow-Lin, Denton, Litterman, Fernandez, and uniform interpolation-as well as enhanced variants with automated estimation of key parameters such as the autocorrelation coefficient rho. The package introduces features beyond classical methods, including robust ensemble modeling via non-negative least squares optimization, post-estimation correction of negative values under multiple aggregation rules, and optional regression-based imputation of missing values through a dedicated Retropolarizer module. Architecturally, it follows a modular design inspired by scikit-learn, offering a clean API for validation, modeling, visualization, and result interpretation.


Feature-Enhanced Machine Learning for All-Cause Mortality Prediction in Healthcare Data

arXiv.org Machine Learning

Accurate patient mortality prediction enables effective risk stratification, leading to personalized treatment plans and improved patient outcomes. However, predicting mortality in healthcare remains a significant challenge, with existing studies often focusing on specific diseases or limited predictor sets. This study evaluates machine learning models for all-cause in-hospital mortality prediction using the MIMIC-III database, employing a comprehensive feature engineering approach. Guided by clinical expertise and literature, we extracted key features such as vital signs (e.g., heart rate, blood pressure), laboratory results (e.g., creatinine, glucose), and demographic information. The Random Forest model achieved the highest performance with an AUC of 0.94, significantly outperforming other machine learning and deep learning approaches. This demonstrates Random Forest's robustness in handling high-dimensional, noisy clinical data and its potential for developing effective clinical decision support tools. Our findings highlight the importance of careful feature engineering for accurate mortality prediction. We conclude by discussing implications for clinical adoption and propose future directions, including enhancing model robustness and tailoring prediction models for specific diseases.


Explainable Boosting Machine for Predicting Claim Severity and Frequency in Car Insurance

arXiv.org Machine Learning

In a context of constant increase in competition and heightened regulatory pressure, accuracy, actuarial precision, as well as transparency and understanding of the tariff, are key issues in non-life insurance. Traditionally used generalized linear models (GLM) result in a multiplicative tariff that favors interpretability. With the rapid development of machine learning and deep learning techniques, actuaries and the rest of the insurance industry have adopted these techniques widely. However, there is a need to associate them with interpretability techniques. In this paper, our study focuses on introducing an Explainable Boosting Machine (EBM) model that combines intrinsically interpretable characteristics and high prediction performance. This approach is described as a glass-box model and relies on the use of a Generalized Additive Model (GAM) and a cyclic gradient boosting algorithm. It accounts for univariate and pairwise interaction effects between features and provides naturally explanations on them. We implement this approach on car insurance frequency and severity data and extensively compare the performance of this approach with classical competitors: a GLM, a GAM, a CART model and an Extreme Gradient Boosting (XGB) algorithm. Finally, we examine the interpretability of these models to capture the main determinants of claim costs.


Simulation-informed deep learning for enhanced SWOT observations of fine-scale ocean dynamics

arXiv.org Machine Learning

Oceanic processes at fine scales are crucial yet difficult to observe accurately due to limitations in satellite and in-situ measurements. The Surface Water and Ocean Topography (SWOT) mission provides high-resolution Sea Surface Height (SSH) data, though noise patterns often obscure fine scale structures. Current methods struggle with noisy data or require extensive supervised training, limiting their effectiveness on real-world observations. We introduce SIMPGEN (Simulation-Informed Metric and Prior for Generative Ensemble Networks), an unsupervised adversarial learning framework combining real SWOT observations with simulated reference data. SIMPGEN leverages wavelet-informed neural metrics to distinguish noisy from clean fields, guiding realistic SSH reconstructions. Applied to SWOT data, SIMPGEN effectively removes noise, preserving fine-scale features better than existing neural methods. This robust, unsupervised approach not only improves SWOT SSH data interpretation but also demonstrates strong potential for broader oceanographic applications, including data assimilation and super-resolution.