Performance Analysis
Quantum Cognition Machine Learning for Forecasting Chromosomal Instability
Di Caro, Giuseppe, Kirakosyan, Vahagn, Abanov, Alexander G., Busemeyer, Jerome R., Candelori, Luca, Hartmann, Nadine, Lam, Ernest T., Musaelian, Kharen, Samson, Ryan, Steinacker, Harold, Villani, Dario, Wells, Martin T., Wenstrup, Richard J., Xu, Mengjia
Unlike traditional tissue tests[1, 2], cell-based liquid biopsy assays enable selection of individual CTCs for the analysis of chromosomal instability using next-generation sequencing by quantification of large-scale state transitions (LST) [3-9]. Chromosomal instability is a genomic characteristic of cancer cells that drives tumor evolution and metastatic potential [10-19]. However, whole genome sequencing assays are laborious, requiring a complex workflow that invariably results in a considerable turnaround time that sometimes is not compatible with clinical practice [20]. A previous study has shown that we can partially predict chromosomal instability in individual cells by developing algorithms that analyze a range of features, including cell shape, size, morphology, and protein levels, from images of CTCs using an automated digital pathology pipeline [3]. Predicting chromosomal instability through morphology offers significant advantages; it can significantly reduce turnaround times compared to whole-genome assays, providing crucial information about the genomic characteristics of CTCs in a patient in a shorter timeframe [3]. Timely information on the presence of CTCs with the highest metastatic potential may be critical for making optimal clinical decisions. A key challenge in predicting chromosomal instability through morphology is the utilization of a machine-learning method that accurately classifies morphology patterns from all CTC features and provides a generalization and reproducibility, compatible with potential validation for clinical use [21-24]. Key limitations of commonly used machine learning techniques in biology applications, such as support vector machines (SVMs) with Gaussian kernels, include the following [21-24]: 1) The increase in dimensionality that arises from combinations of multiple features exponentially complicates the prediction task, as often seen with cell morphologies.
Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics
Al-Khalili, Zena, Howell, Nick, Klakow, Dietrich
Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs' limits in the math domain.
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Damani, Mehul, Puri, Isha, Slocum, Stewart, Shenfeld, Idan, Choshen, Leshem, Kim, Yoon, Andreas, Jacob
When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.
ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs
Zhang, Zhenliang, Hu, Xinyu, Zhang, Huixuan, Zhang, Junzhe, Wan, Xiaojun
Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states' update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.
Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling
Krchova, Ivona, Platzer, Michael, Tiwald, Paul
Unbalanced tabular data sets present significant challenges for predictive modeling and data analysis across a wide range of applications. In many real-world scenarios, such as fraud detection, medical diagnosis, and rare event prediction, minority classes are vastly underrepresented, making it difficult for traditional machine learning algorithms to achieve high accuracy. These algorithms tend to favor the majority class, leading to biased models that struggle to accurately represent minority classes. Synthetic data holds promise for addressing the under-representation of minority classes by providing new, diverse, and highly realistic samples. This paper presents a benchmark study on the use of AI-generated synthetic data for upsampling highly unbalanced tabular data sets. We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data. We compare predictive models trained on data sets upsampled with synthetic records to those using standard methods, such as naive oversampling and SMOTE-NC. Our results demonstrate that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space. We show that upsampled synthetic training data consistently results in top-performing predictive models, particularly for mixed-type data sets containing very few minority samples.
Feature Construction Using Network Control Theory and Rank Encoding for Graph Machine Learning
Said, Anwar, Wei, Yifan, Ahmad, Obaid Ullah, Shabbir, Mudassir, Abbas, Waseem, Koutsoukos, Xenofon
In this article, we utilize the concept of average controllability in graphs, along with a novel rank encoding method, to enhance the performance of Graph Neural Networks (GNNs) in social network classification tasks. GNNs have proven highly effective in various network-based learning applications and require some form of node features to function. However, their performance is heavily influenced by the expressiveness of these features. In social networks, node features are often unavailable due to privacy constraints or the absence of inherent attributes, making it challenging for GNNs to achieve optimal performance. To address this limitation, we propose two strategies for constructing expressive node features. First, we introduce average controllability along with other centrality metrics (denoted as NCT-EFA) as node-level metrics that capture critical aspects of network topology. Building on this, we develop a rank encoding method that transforms average controllability or any other graph-theoretic metric into a fixed-dimensional feature space, thereby improving feature representation. We conduct extensive numerical evaluations using six benchmark GNN models across four social network datasets to compare different node feature construction methods. Our results demonstrate that incorporating average controllability into the feature space significantly improves GNN performance. Moreover, the proposed rank encoding method outperforms traditional one-hot degree encoding, improving the ROC AUC from 68.7% to 73.9% using GraphSAGE on the GitHub Stargazers dataset, underscoring its effectiveness in generating expressive and efficient node representations.
Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection
Morales-Forero, Andrés, Rueda, Lili J., Herrera, Ronald, Bassetto, Samuel, Coatanea, Eric
Artificial intelligence (AI) systems increasingly inform medical decision-making, yet concerns about algorithmic bias and inequitable outcomes persist, particularly for historically marginalized populations. This paper introduces the concept of Predictive Representativity (PR), a framework of fairness auditing that shifts the focus from the composition of the data set to outcomes-level equity. Through a case study in dermatology, we evaluated AI-based skin cancer classifiers trained on the widely used HAM10000 dataset and on an independent clinical dataset (BOSQUE Test set) from Colombia. Our analysis reveals substantial performance disparities by skin phototype, with classifiers consistently underperforming for individuals with darker skin, despite proportional sampling in the source data. We argue that representativity must be understood not as a static feature of datasets but as a dynamic, context-sensitive property of model predictions. PR operationalizes this shift by quantifying how reliably models generalize fairness across subpopulations and deployment contexts. We further propose an External Transportability Criterion that formalizes the thresholds for fairness generalization. Our findings highlight the ethical imperative for post-hoc fairness auditing, transparency in dataset documentation, and inclusive model validation pipelines. This work offers a scalable tool for diagnosing structural inequities in AI systems, contributing to discussions on equity, interpretability, and data justice and fostering a critical re-evaluation of fairness in data-driven healthcare.
Complex Dynamics in Psychological Data: Mapping Individual Symptom Trajectories to Group-Level Patterns
Vitanza, Eleonora, DeLellis, Pietro, Mocenni, Chiara, Marin, Manuel Ruiz
This study integrates causal inference, graph analysis, temporal complexity measures, and machine learning to examine whether individual symptom trajectories can reveal meaningful diagnostic patterns. Testing on a longitudinal dataset of N=45 individuals affected by General Anxiety Disorder (GAD) and/or Major Depressive Disorder (MDD) derived from Fisher et al. 2017, we propose a novel pipeline for the analysis of the temporal dynamics of psychopathological symptoms. First, we employ the PCMCI+ algorithm with nonparametric independence test to determine the causal network of nonlinear dependencies between symptoms in individuals with different mental disorders. We found that the PCMCI+ effectively highlights the individual peculiarities of each symptom network, which could be leveraged towards personalized therapies. At the same time, aggregating the networks by diagnosis sheds light to disorder-specific causal mechanisms, in agreement with previous psychopathological literature. Then, we enrich the dataset by computing complexity-based measures (e.g. entropy, fractal dimension, recurrence) from the symptom time series, and feed it to a suitably selected machine learning algorithm to aid the diagnosis of each individual. The new dataset yields 91% accuracy in the classification of the symptom dynamics, proving to be an effective diagnostic support tool. Overall, these findings highlight how integrating causal modeling and temporal complexity can enhance diagnostic differentiation, offering a principled, data-driven foundation for both personalized assessment in clinical psychology and structural advances in psychological research.
Exact Reformulation and Optimization for Direct Metric Optimization in Binary Imbalanced Classification
Peng, Le, Travadi, Yash, He, Chuan, Cui, Ying, Sun, Ju
For classification with imbalanced class frequencies, i.e., imbalanced classification (IC), standard accuracy is known to be misleading as a performance measure. While most existing methods for IC resort to optimizing balanced accuracy (i.e., the average of class-wise recalls), they fall short in scenarios where the significance of classes varies or certain metrics should reach prescribed levels. In this paper, we study two key classification metrics, precision and recall, under three practical binary IC settings: fix precision optimize recall (FPOR), fix recall optimize precision (FROP), and optimize $F_β$-score (OFBS). Unlike existing methods that rely on smooth approximations to deal with the indicator function involved, \textit{we introduce, for the first time, exact constrained reformulations for these direct metric optimization (DMO) problems}, which can be effectively solved by exact penalty methods. Experiment results on multiple benchmark datasets demonstrate the practical superiority of our approach over the state-of-the-art methods for the three DMO problems. We also expect our exact reformulation and optimization (ERO) framework to be applicable to a wide range of DMO problems for binary IC and beyond. Our code is available at https://github.com/sun-umn/DMO.
Beyond Sin-Squared Error: Linear-Time Entrywise Uncertainty Quantification for Streaming PCA
Kumar, Syamantak, Pandey, Shourya, Sarkar, Purnamrita
We propose a novel statistical inference framework for streaming principal component analysis (PCA) using Oja's algorithm, enabling the construction of confidence intervals for individual entries of the estimated eigenvector. Most existing works on streaming PCA focus on providing sharp sin-squared error guarantees. Recently, there has been some interest in uncertainty quantification for the sin-squared error. However, uncertainty quantification or sharp error guarantees for entries of the estimated eigenvector in the streaming setting remains largely unexplored. We derive a sharp Bernstein-type concentration bound for elements of the estimated vector matching the optimal error rate up to logarithmic factors. We also establish a Central Limit Theorem for a suitably centered and scaled subset of the entries. To efficiently estimate the coordinate-wise variance, we introduce a provably consistent subsampling algorithm that leverages the median-of-means approach, empirically achieving similar accuracy to multiplier bootstrap methods while being significantly more computationally efficient. Numerical experiments demonstrate its effectiveness in providing reliable uncertainty estimates with a fraction of the computational cost of existing methods.