Performance Analysis
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
AlOtaibi, Areej, Alyahya, Lina, Alshabanah, Raghad, Alfawzan, Shahad, Alarefei, Shuruq, Alsabti, Reem, Alsubaie, Nouf, Alhuzaymi, Abdulaziz, Alkhelb, Lujain, Alsayari, Majd, Alahmed, Waad, Talabay, Omar, Alowibdi, Jalal, Alelyani, Salem, Bibi, Adel
Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.
Evidence Without Injustice: A New Counterfactual Test for Fair Algorithms
Loi, Michele, Di Bello, Marcello, Cangiotti, Nicolò
The growing philosophical literature on algorithmic fairness has examined statistical criteria such as equalized odds and calibration, causal and counterfactual approaches, and the role of structural and compounding injustices. Yet an important dimension has been overlooked: whether the evidential value of an algorithmic output itself depends on structural injustice. We contrast a predictive policing algorithm, which relies on historical crime data, with a camera-based system that records ongoing offenses, where both are designed to guide police deployment. In evaluating the moral acceptability of acting on a piece of evidence, we must ask not only whether the evidence is probative in the actual world, but also whether it would remain probative in nearby worlds without the relevant injustices. The predictive policing algorithm fails this test, but the camera-based system passes it. When evidence fails the test, it is morally problematic to use it punitively, more so than evidence that passes the test.
A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans
Perera, Irash, Thayasivam, Uthayasanker
Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.
Transfer Learning on Edge Connecting Probability Estimation under Graphon Model
Wang, Yuyao, Cheng, Yu-Hung, Mukherjee, Debarghya, Cheng, Huimin
Graphon models provide a flexible nonparametric framework for estimating latent connectivity probabilities in networks, enabling a range of downstream applications such as link prediction and data augmentation. However, accurate graphon estimation typically requires a large graph, whereas in practice, one often only observes a small-sized network. One approach to addressing this issue is to adopt a transfer learning framework, which aims to improve estimation in a small target graph by leveraging structural information from a larger, related source graph. In this paper, we propose a novel method, namely GTRANS, a transfer learning framework that integrates neighborhood smoothing and Gromov-Wasserstein optimal transport to align and transfer structural patterns between graphs. To prevent negative transfer, GTRANS includes an adaptive debiasing mechanism that identifies and corrects for target-specific deviations via residual smoothing. We provide theoretical guarantees on the stability of the estimated alignment matrix and demonstrate the effectiveness of GTRANS in improving the accuracy of target graph estimation through extensive synthetic and real data experiments. These improvements translate directly to enhanced performance in downstream applications, such as the graph classification task and the link prediction task.
EuroSpeech: A Multilingual Speech Corpus
Pfisterer, Samuel, Grötschla, Florian, Lanzendörfer, Luca A., Yan, Florian, Wattenhofer, Roger
Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.
Explainable artificial intelligence model predicting the risk of all-cause mortality in patients with type 2 diabetes mellitus
Vershinina, Olga, Sabbatinelli, Jacopo, Bonfigli, Anna Rita, Colombaretti, Dalila, Giuliani, Angelica, Krivonosov, Mikhail, Trukhanov, Arseniy, Franceschi, Claudio, Ivanchenko, Mikhail, Olivieri, Fabiola
Objective. Type 2 diabetes mellitus (T2DM) is a highly prevalent non-communicable chronic disease that substantially reduces life expectancy. Accurate estimation of all-cause mortality risk in T2DM patients is crucial for personalizing and optimizing treatment strategies. Research Design and Methods. This study analyzed a cohort of 554 patients (aged 40-87 years) with diagnosed T2DM over a maximum follow-up period of 16.8 years, during which 202 patients (36%) died. Key survival-associated features were identified, and multiple machine learning (ML) models were trained and validated to predict all-cause mortality risk. To improve model interpretability, Shapley additive explanations (SHAP) was applied to the best-performing model. Results. The extra survival trees (EST) model, incorporating ten key features, demonstrated the best predictive performance. The model achieved a C-statistic of 0.776, with the area under the receiver operating characteristic curve (AUC) values of 0.86, 0.80, 0.841, and 0.826 for 5-, 10-, 15-, and 16.8-year all-cause mortality predictions, respectively. The SHAP approach was employed to interpret the model's individual decision-making processes. Conclusions. The developed model exhibited strong predictive performance for mortality risk assessment. Its clinically interpretable outputs enable potential bedside application, improving the identification of high-risk patients and supporting timely treatment optimization.
Complexity Dependent Error Rates for Physics-informed Statistical Learning via the Small-ball Method
Physics-informed statistical learning (PISL) integrates empirical data with physical knowledge to enhance the statistical performance of estimators. While PISL methods are widely used in practice, a comprehensive theoretical understanding of how informed regularization affects statistical properties is still missing. Specifically, two fundamental questions have yet to be fully addressed: (1) what is the trade-off between considering soft penalties versus hard constraints, and (2) what is the statistical gain of incorporating physical knowledge compared to purely data-driven empirical error minimisation. In this paper, we address these questions for PISL in convex classes of functions under physical knowledge expressed as linear equations by developing appropriate complexity dependent error rates based on the small-ball method. We show that, under suitable assumptions, (1) the error rates of physics-informed estimators are comparable to those of hard constrained empirical error minimisers, differing only by constant terms, and that (2) informed penalization can effectively reduce model complexity, akin to dimensionality reduction, thereby improving learning performance. This work establishes a theoretical framework for evaluating the statistical properties of physics-informed estimators in convex classes of functions, contributing to closing the gap between statistical theory and practical PISL, with potential applications to cases not yet explored in the literature.
MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification
Caldera, Luca, Bottacini, Giacomo, Cavinato, Lara
Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, without task alignment, fails to provide a robust foundation for clinical use. We propose MAGIC-Flow, a conditional multiscale normalizing flow architecture that performs generation and classification within a single modular framework. The model is built as a hierarchy of invertible and differentiable bijections, where the Jacobian determinant factorizes across sub-transformations. We show how this ensures exact likelihood computation and stable optimization, while invertibility enables explicit visualization of sample likelihoods, providing an interpretable lens into the model's reasoning. By conditioning on class labels, MAGIC-Flow supports controllable sample synthesis and principled class-probability estimation, effectively aiding both generative and discriminative objectives. We evaluate MAGIC-Flow against top baselines using metrics for similarity, fidelity, and diversity. Across multiple datasets, it addresses generation and classification under scanner noise, and modality-specific synthesis and identification. Results show MAGIC-Flow creates realistic, diverse samples and improves classification. MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, with direct benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.
Frequentist Validity of Epistemic Uncertainty Estimators
Decomposing prediction uncertainty into its aleatoric (irreducible) and epistemic (reducible) components is critical for the development and deployment of machine learning systems. A popular, principled measure for epistemic uncertainty is the mutual information between the response variable and model parameters. However, evaluating this measure requires access to the posterior distribution of the model parameters, which is challenging to compute. In view of this, we introduce a frequentist measure of epistemic uncertainty based on the bootstrap. Our main theoretical contribution is a novel asymptotic expansion that reveals that our proposed (frequentist) measure and the (Bayesian) mutual information are asymptotically equivalent. This provides frequentist interpretations to mutual information and new computational strategies for approximating it. Moreover, we link our proposed approach to the widely-used heuristic approach of deep ensembles, giving added perspective on their practical success.
Fast Non-Log-Concave Sampling under Nonconvex Equality and Inequality Constraints with Landing
Jeon, Kijung, Muehlebach, Michael, Tao, Molei
Sampling from constrained statistical distributions is a fundamental task in various fields including Bayesian statistics, computational chemistry, and statistical physics. This article considers the cases where the constrained distribution is described by an unconstrained density, as well as additional equality and/or inequality constraints, which often make the constraint set nonconvex. Existing methods for nonconvex constraint set $Σ\subset \mathbb{R}^d$ defined by equality or inequality constraints commonly rely on costly projection steps. Moreover, they cannot handle equality and inequality constraints simultaneously as each method only specialized in one case. In addition, rigorous and quantitative convergence guarantee is often lacking. In this paper, we introduce Overdamped Langevin with LAnding (OLLA), a new framework that can design overdamped Langevin dynamics accommodating both equality and inequality constraints. The proposed dynamics also deterministically corrects trajectories along the normal direction of the constraint surface, thus obviating the need for explicit projections. We show that, under suitable regularity conditions on the target density and $Σ$, OLLA converges exponentially fast in $W_2$ distance to the constrained target density $ρ_Σ(x) \propto \exp(-f(x))dσ_Σ$. Lastly, through experiments, we demonstrate the efficiency of OLLA compared to projection-based constrained Langevin algorithms and their slack variable variants, highlighting its favorable computational cost and reasonable empirical mixing.