Performance Analysis
Kernel-based Equalized Odds: A Quantification of Accuracy-Fairness Trade-off in Fair Representation Learning
This paper introduces a novel kernel-based formulation of the Equalized Odds (EO) criterion, denoted as $EO_k$, for fair representation learning (FRL) in supervised settings. The central goal of FRL is to mitigate discrimination regarding a sensitive attribute $S$ while preserving prediction accuracy for the target variable $Y$. Our proposed criterion enables a rigorous and interpretable quantification of three core fairness objectives: independence (prediction $\hat{Y}$ is independent of $S$), separation (also known as equalized odds; prediction $\hat{Y}$ is independent with $S$ conditioned on target attribute $Y$), and calibration ($Y$ is independent of $S$ conditioned on the prediction $\hat{Y}$). Under both unbiased ($Y$ is independent of $S$) and biased ($Y$ depends on $S$) conditions, we show that $EO_k$ satisfies both independence and separation in the former, and uniquely preserves predictive accuracy while lower bounding independence and calibration in the latter, thereby offering a unified analytical characterization of the tradeoffs among these fairness criteria. We further define the empirical counterpart, $\hat{EO}_k$, a kernel-based statistic that can be computed in quadratic time, with linear-time approximations also available. A concentration inequality for $\hat{EO}_k$ is derived, providing performance guarantees and error bounds, which serve as practical certificates of fairness compliance. While our focus is on theoretical development, the results lay essential groundwork for principled and provably fair algorithmic design in future empirical studies.
Twin-Boot: Uncertainty-Aware Optimization via Online Two-Sample Bootstrapping
Standard gradient descent methods yield point estimates with no measure of confidence. This limitation is acute in overparameterized and low-data regimes, where models have many parameters relative to available data and can easily overfit. Bootstrapping is a classical statistical framework for uncertainty estimation based on resampling, but naively applying it to deep learning is impractical: it requires training many replicas, produces post-hoc estimates that cannot guide learning, and implicitly assumes comparable optima across runs - an assumption that fails in non-convex landscapes. We introduce Twin-Bootstrap Gradient Descent (Twin-Boot), a resampling-based training procedure that integrates uncertainty estimation into optimization. Two identical models are trained in parallel on independent bootstrap samples, and a periodic mean-reset keeps both trajectories in the same basin so that their divergence reflects local (within-basin) uncertainty. During training, we use this estimate to sample weights in an adaptive, data-driven way, providing regularization that favors flatter solutions. In deep neural networks and complex high-dimensional inverse problems, the approach improves calibration and generalization and yields interpretable uncertainty maps.
Computational Resolution of Hadamard Product Factorization for $4 \times 4$ Matrices
We computationally resolve an open problem concerning the expressibility of $4 \times 4$ full-rank matrices as Hadamard products of two rank-2 matrices. Through exhaustive search over $\mathbb{F}_2$, we identify 5,304 counterexamples among the 20,160 full-rank binary matrices (26.3\%). We verify that these counterexamples remain valid over $\mathbb{Z}$ through sign enumeration and provide strong numerical evidence for their validity over $\mathbb{R}$. Remarkably, our analysis reveals that matrix density (number of ones) is highly predictive of expressibility, achieving 95.7\% classification accuracy. Using modern machine learning techniques, we discover that expressible matrices lie on an approximately 10-dimensional variety within the 16-dimensional ambient space, despite the naive parameter count of 24 (12 parameters each for two $4 \times 4$ rank-2 matrices). This emergent low-dimensional structure suggests deep algebraic constraints governing Hadamard factorizability.
Classification errors distort findings in automated speech processing: examples and solutions from child-development research
Gautheron, Lucas, Kidd, Evan, Malko, Anton, Lavechin, Marvin, Cristia, Alejandrina
With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper proposes a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children's language experience and the association between children's production and their input. In both the most commonly used \gls{lena}, and an open-source alternative (the Voice Type Classifier from the ACLEW system), we find that classification errors can significantly distort estimates. For instance, automated annotations underestimated the negative effect of siblings on adult input by 20--80\%, potentially placing it below statistical significance thresholds. We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution. Both the issue reported and our solution may apply to any classifier involving event detection and classification with non-zero error rates.
Are Virtual DES Images a Valid Alternative to the Real Ones?
Perre, Ana C., Alexandre, Luรญs A., Freire, Luรญs C.
Contrast-enhanced spectral mammography (CESM) is an imaging modality that provides two types of images, commonly known as low-energy (LE) and dual-energy subtracted (DES) images. In many domains, particularly in medicine, the emergence of image-to-image translation techniques has enabled the artificial generation of images using other images as input. Within CESM, applying such techniques to generate DES images from LE images could be highly beneficial, potentially reducing patient exposure to radiation associated with high-energy image acquisition. In this study, we investigated three models for the artificial generation of DES images (virtual DES): a pre-trained U-Net model, a U-Net trained end-to-end model, and a CycleGAN model. We also performed a series of experiments to assess the impact of using virtual DES images on the classification of CESM examinations into malignant and non-malignant categories. To our knowledge, this is the first study to evaluate the impact of virtual DES images on CESM lesion classification. The results demonstrate that the best performance was achieved with the pre-trained U-Net model, yielding an F1 score of 85.59% when using the virtual DES images, compared to 90.35% with the real DES images. This discrepancy likely results from the additional diagnostic information in real DES images, which contributes to a higher classification accuracy. Nevertheless, the potential for virtual DES image generation is considerable and future advancements may narrow this performance gap to a level where exclusive reliance on virtual DES images becomes clinically viable.
High-dimensional Asymptotics of Generalization Performance in Continual Ridge Regression
Zhao, Yihan, Su, Wenqing, Yang, Ying
Continual learning is motivated by the need to adapt to real-world dynamics in tasks and data distribution while mitigating catastrophic forgetting. Despite significant advances in continual learning techniques, the theoretical understanding of their generalization performance lags behind. This paper examines the theoretical properties of continual ridge regression in high-dimensional linear models, where the dimension is proportional to the sample size in each task. Using random matrix theory, we derive exact expressions of the asymptotic prediction risk, thereby enabling the characterization of three evaluation metrics of generalization performance in continual learning: average risk, backward transfer, and forward transfer. Furthermore, we present the theoretical risk curves to illustrate the trends in these evaluation metrics throughout the continual learning process. Our analysis reveals several intriguing phenomena in the risk curves, demonstrating how model specifications influence the generalization performance. Simulation studies are conducted to validate our theoretical findings.
Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
Reyes-Amezcua, Ivan, Lopez-Tiro, Francisco, Larose, Clement, Mendez-Vazquez, Andres, Ochoa-Ruiz, Gilberto, Daul, Christian
Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convo-lutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.
A Novel Vascular Risk Scoring Framework for Quantifying Sex-Specific Cerebral Perfusion from 3D pCASL MRI
Noble, Sneha, Sinha, Neelam, Sundareshan, Vaanathi, Issac, Thomas Gregor
ABSTRACT The influence of sex and age on cerebral perfusion is recognized, but the specific impacts on regional cerebral blood flow (CBF) and vascular risk remain to be fully characterized. In this study, 3D pseudo-continuous arterial spin labeling (pCASL) MRI was used to identify sex and age related CBF patterns, and a vascular risk score (VRS) was developed based on normative perfusion profiles. Perfusion data from 186 cognitively healthy participants (89 males, 97 females; aged 8 to 92 years), obtained from a publicly available dataset, were analyzed. An extension of the 3D Simple Linear Iterative Clustering (SLIC) supervoxel algorithm was applied to CBF maps to group neighboring voxels with similar intensities into anatomically meaningful regions. Regional CBF features were extracted and used to train a convolutional neural network (CNN) for sex classification and perfusion pattern analysis. Global, age related CBF changes were also assessed. Participant specific VRS was computed by comparing individual CBF profiles to age and sex specific normative data to quantify perfusion deficits. A 95 percent accuracy in sex classification was achieved using the proposed supervoxel based method, and distinct perfusion signatures were identified. Higher CBF was observed in females in medial Brod-mann areas 6 and 10, area V5, occipital polar cortex, and insular regions. A global decline in CBF with age was observed in both sexes. Individual perfusion deficits were quantified using VRS, providing a personalized biomarker for early hy-poperfusion. Sex and age specific CBF patterns were identified, and a personalized vascular risk biomarker was proposed, contributing to advancements in precision neurology. Keywords-- 3D pCASL MRI, CBF, age-and sex-specific perfusion patterns, vascular risk score, cognitively healthy 1. INTRODUCTION Arterial Spin Labeling (ASL) is a non-invasive Magnetic Resonance Imaging (MRI) technique designed to quantitatively assess cerebral blood flow (CBF) by magnetically labeling endogenous arterial blood water protons without the need for exogenous contrast agents or ionizing radiation [1]. The ASL technique involves three key steps: (i) magnetic labeling of arterial blood proximal to the imaging region, (ii) delivery of magnetically tagged blood to brain tissue altering the local MR signal, and (iii) acquisition of paired labeled and control images whose subtraction yields perfusion-weighted maps [1].
Revisiting Pre-processing Group Fairness: A Modular Benchmarking Framework
Oldfield, Brodie, Xu, Ziqi, Kandanaarachchi, Sevvandi
As machine learning systems become increasingly integrated into high-stakes decision-making processes, ensuring fairness in algorithmic outcomes has become a critical concern. Methods to mitigate bias typically fall into three categories: pre-processing, in-processing, and post-processing. While significant attention has been devoted to the latter two, pre-processing methods, which operate at the data level and offer advantages such as model-agnosticism and improved privacy compliance, have received comparatively less focus and lack standardised evaluation tools. In this work, we introduce FairPrep, an extensible and modular benchmarking framework designed to evaluate fairness-aware pre-processing techniques on tabular datasets. Built on the AIF360 platform, FairPrep allows seamless integration of datasets, fairness interventions, and predictive models. It features a batch-processing interface that enables efficient experimentation and automatic reporting of fairness and utility metrics. By offering standardised pipelines and supporting reproducible evaluations, FairPrep fills a critical gap in the fairness benchmarking landscape and provides a practical foundation for advancing data-level fairness research.
Enhanced Predictive Modeling for Hazardous Near-Earth Object Detection: A Comparative Analysis of Advanced Resampling Strategies and Machine Learning Algorithms in Planetary Risk Assessment
This study evaluates the performance of several machine learning models for predicting hazardous near-Earth objects (NEOs) through a binary classification framework, including data scaling, power transformation, and cross-validation. Six classifiers were compared, namely Random Forest Classifier (RFC), Gradient Boosting Classifier (GBC), Support Vector Classifier (SVC), Linear Discriminant Analysis (LDA), Logistic Regression (LR), and K-Nearest Neighbors (KNN). RFC and GBC performed the best, both with an impressive F2-score of 0.987 and 0.986, respectively, with very small variability. SVC followed, with a lower but reasonable score of 0.896. LDA and LR had a moderate performance with scores of around 0.749 and 0.748, respectively, while KNN had a poor performance with a score of 0.691 due to difficulty in handling complex data patterns. RFC and GBC also presented great confusion matrices with a negligible number of false positives and false negatives, which resulted in outstanding accuracy rates of 99.7% and 99.6%, respectively. These findings highlight the power of ensemble methods for high precision and recall and further point out the importance of tailored model selection with regard to dataset characteristics and chosen evaluation metrics. Future research could focus on the optimization of hyperparameters with advanced features engineering to further the accuracy and robustness of the model on NEO hazard predictions.