Performance Analysis
Hierarchical Scoring for Machine Learning Classifier Error Impact Evaluation
Lanus, Erin, Wolodkin, Daniel, Freeman, Laura J.
A common use of machine learning (ML) models is predicting the class of a sample. Object detection is an extension of classification that includes localization of the object via a bounding box within the sample. Classification, and by extension object detection, is typically evaluated by counting a prediction as incorrect if the predicted label does not match the ground truth label. This pass/fail scoring treats all misclassifications as equivalent. In many cases, class labels can be organized into a class taxonomy with a hierarchical structure to either reflect relationships among the data or operator valuation of misclassifications. When such a hierarchical structure exists, hierarchical scoring metrics can return the model performance of a given prediction related to the distance between the prediction and the ground truth label. Such metrics can be viewed as giving partial credit to predictions instead of pass/fail, enabling a finer-grained understanding of the impact of misclassifications. This work develops hierarchical scoring metrics varying in complexity that utilize scoring trees to encode relationships between class labels and produce metrics that reflect distance in the scoring tree. The scoring metrics are demonstrated on an abstract use case with scoring trees that represent three weighting strategies and evaluated by the kind of errors discouraged. Results demonstrate that these metrics capture errors with finer granularity and the scoring trees enable tuning. This work demonstrates an approach to evaluating ML performance that ranks models not only by how many errors are made but by the kind or impact of errors. Python implementations of the scoring metrics will be available in an open-source repository at time of publication.
Detection of Autonomic Dysreflexia in Individuals With Spinal Cord Injury Using Multimodal Wearable Sensors
Fuchs, Bertram, Ejtehadi, Mehdi, Cisnal, Ana, Pannek, Jรผrgen, Scheel-Sailer, Anke, Riener, Robert, Eriks-Hoogland, Inge, Paez-Granados, Diego
Autonomic Dysreflexia (AD) is a potentially life-threatening condition characterized by sudden, severe blood pressure (BP) spikes in individuals with spinal cord injury (SCI). Early, accurate detection is essential to prevent cardiovascular complications, yet current monitoring methods are either invasive or rely on subjective symptom reporting, limiting applicability in daily file. This study presents a non-invasive, explainable machine learning framework for detecting AD using multimodal wearable sensors. Data were collected from 27 individuals with chronic SCI during urodynamic studies, including electrocardiography (ECG), photoplethysmography (PPG), bioimpedance (BioZ), temperature, respiratory rate (RR), and heart rate (HR), across three commercial devices. Objective AD labels were derived from synchronized cuff-based BP measurements. Following signal preprocessing and feature extraction, BorutaSHAP was used for robust feature selection, and SHAP values for explainability. We trained modality- and device-specific weak learners and aggregated them using a stacked ensemble meta-model. Cross-validation was stratified by participants to ensure generalizability. HR- and ECG-derived features were identified as the most informative, particularly those capturing rhythm morphology and variability. The Nearest Centroid ensemble yielded the highest performance (Macro F1 = 0.77+/-0.03), significantly outperforming baseline models. Among modalities, HR achieved the highest area under the curve (AUC = 0.93), followed by ECG (0.88) and PPG (0.86). RR and temperature features contributed less to overall accuracy, consistent with missing data and low specificity. The model proved robust to sensor dropout and aligned well with clinical AD events. These results represent an important step toward personalized, real-time monitoring for individuals with SCI.
Heterogeneity-Oblivious Robust Federated Learning
Zhang, Weiyao, Li, Jinyang, Song, Qi, Wang, Miao, Lin, Chungang, Luo, Haitong, Meng, Xuying, Zhang, Yujun
--Federated Learning (FL) remains highly vulnerable to poisoning attacks, especially under real-world hyper-heterogeneity, where clients differ significantly in data distributions, communication capabilities, and model architectures. Such heterogeneity not only undermines the effectiveness of aggregation strategies but also makes attacks more difficult to detect. Furthermore, high-dimensional models expand the attack surface. T o address these challenges, we propose Horus, a heterogeneity-oblivious robust FL framework centered on low-rank adaptations (LoRAs). Rather than aggregating full model parameters, Horus inserts LoRAs into empirically stable layers and aggregates only LoRAs to reduce the attack surface. We uncover a key empirical observation that the input projection (LoRA-A) is markedly more stable than the output projection (LoRA-B) under heterogeneity and poisoning. Leveraging this, we design a Heterogeneity-Oblivious Poisoning Score using the features from LoRA-A to filter poisoned clients. For the remaining benign clients, we propose projection-aware aggregation mechanism to preserve collaborative signals while suppressing drifts, which reweights client updates by consistency with the global directions. Extensive experiments across diverse datasets, model architectures, and attacks demonstrate that Horus consistently outperforms state-of-the-art baselines in both robustness and accuracy. Federated Learning (FL) has gained significant traction as a privacy-preserving paradigm for distributed training, enabling clients to collaboratively learn a global model without sharing their raw data [12], [20]. However, the decentralized nature of FL inherently introduces serious security vulnerabilities, making it susceptible to poisoning attacks, in which attackers inject malicious data or local updates. Such attacks pose a particularly insidious threat, as they can stealthily degrade or manipulate the global model over time [29]. For example, perturbing a federated model deployed in vehicular systems could autonomously start the vehicle or execute an emergency brake, thereby endangering human lives and compromising property safety [24].
Enhancing Multi-view Open-set Learning via Ambiguity Uncertainty Calibration and View-wise Debiasing
Fang, Zihan, Xu, Zhiyong, Du, Lan, Du, Shide, Cai, Zhiling, Wang, Shiping
Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.
funOCLUST: Clustering Functional Data with Outliers
Clark, Katharine M., McNicholas, Paul D.
An extension of the OCLUST algorithm to the functional setting is proposed to address these issue s. The approach leverages the OCLUST framework, creating a robust method to cluster cu rves and trim outliers. The methodology is evaluated on both simulated and real-wor ld functional datasets, demonstrating strong performance in clustering and outlie r identification.
Memorization in Fine-Tuned Large Language Models
This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model's propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns.
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
Ernhofer, Benjamin Raphael, Prokhorov, Daniil, Langner, Jannica, Bollmann, Dominik
Modern automotive infotainment systems necessitate intelligent and adaptive solutions to manage frequent User Interface (UI) updates and diverse design variations. This work introduces a vision-language framework to facilitate the understanding of and interaction with automotive UIs, enabling seamless adaptation across different UI designs. To support research in this field, AutomotiveUI-Bench-4K, an open-source dataset comprising 998 images with 4,208 annotations, is also released. Additionally, a data pipeline for generating training data is presented. A Molmo-7B-based model is fine-tuned using Low-Rank Adaptation (LoRa), incorporating generated reasoning along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face). The approach demonstrates strong cross-domain generalization, including a +5.6% improvement on ScreenSpot over the baseline model. An average accuracy of 80.8% is achieved on ScreenSpot, closely matching or surpassing specialized models for desktop, mobile, and web, despite being trained primarily on the automotive domain. This research investigates how data collection and subsequent fine-tuning can lead to AI-driven advancements in automotive UI understanding and interaction. The applied method is cost-efficient, and fine-tuned models can be deployed on consumer-grade GPUs.
Error Detection and Correction for Interpretable Mathematics in Large Language Models
Yang, Yijin, Cornelio, Cristina, Leiva, Mario, Shakarian, Paulo
Recent large language models (LLMs) have demonstrated the ability to perform explicit multi-step reasoning such as chain-of-thought prompting. However, their intermediate steps often contain errors that can propagate leading to inaccurate final predictions. Additionally, LLMs still struggle with hallucinations and often fail to adhere to prescribed output formats, which is particularly problematic for tasks like generating mathematical expressions or source code. This work introduces EDCIM (Error Detection and Correction for Interpretable Mathematics), a method for detecting and correcting these errors in interpretable mathematics tasks, where the model must generate the exact functional form that explicitly solve the problem (expressed in natural language) rather than a black-box solution. EDCIM uses LLMs to generate a system of equations for a given problem, followed by a symbolic error-detection framework that identifies errors and provides targeted feedback for LLM-based correction. To optimize efficiency, EDCIM integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy. This balance is controlled by a single hyperparameter, allowing users to control the trade-off based on their cost and accuracy requirements. Experimental results across different datasets show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy when the balance is properly configured.
Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?
Chen, Zhenpeng, Li, Xinyue, Zhang, Jie M., Sun, Weisong, Xiao, Ying, Li, Tianlin, Lou, Yiling, Liu, Yang
Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.
Strategic Hypothesis Testing
Hossain, Safwan, Chen, Yatong, Chen, Yiling
We examine hypothesis testing within a principal-agent framework, where a strategic agent, holding private beliefs about the effectiveness of a product, submits data to a principal who decides on approval. The principal employs a hypothesis testing rule, aiming to pick a p-value threshold that balances false positives and false negatives while anticipating the agent's incentive to maximize expected profitability. Building on prior work, we develop a game-theoretic model that captures how the agent's participation and reporting behavior respond to the principal's statistical decision rule. Despite the complexity of the interaction, we show that the principal's errors exhibit clear monotonic behavior when segmented by an efficiently computable critical p-value threshold, leading to an interpretable characterization of their optimal p-value threshold. We empirically validate our model and these insights using publicly available data on drug approvals. Overall, our work offers a comprehensive perspective on strategic interactions within the hypothesis testing framework, providing technical and regulatory insights.