Performance Analysis
Happiness as a Measure of Fairness
Pichler, Georg, Romanelli, Marco, Piantanida, Pablo
In this paper, we propose a novel fairness framework grounded in the concept of happiness, a measure of the utility each group gains from decision outcomes. By capturing fairness through this intuitive lens, we not only offer a more human-centered approach, but also one that is mathematically rigorous: In order to compute the optimal, fair post-processing strategy, only a linear program needs to be solved. This makes our method both efficient and scalable with existing optimization tools. Furthermore, it unifies and extends several well-known fairness definitions, and our empirical results highlight its practical strengths across diverse scenarios.
SHAP values through General Fourier Representations: Theory and Applications
This article establishes a rigorous spectral framework for the mathematical analysis of SHAP values. We show that any predictive model defined on a discrete or multi-valued input space admits a generalized Fourier expansion with respect to an orthonormalisation tensor-product basis constructed under a product probability measure. Within this setting, each SHAP attribution can be represented as a linear functional of the model's Fourier coefficients. Two complementary regimes are studied. In the deterministic regime, we derive quantitative stability estimates for SHAP values under Fourier truncation, showing that the attribution map is Lipschitz continuous with respect to the distance between predictors. In the probabilistic regime, we consider neural networks in their infinite-width limit and prove convergence of SHAP values toward those induced by the corresponding Gaussian process prior, with explicit error bounds in expectation and with high probability based on concentration inequalities. We also provide a numerical experiment on a clinical unbalanced dataset to validate the theoretical findings.
Aeolus: A Multi-structural Flight Delay Dataset
Xu, Lin, Yuan, Xinyun, Liang, Yuxuan, Yin, Suwan, Wu, Yuankai
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
A Machine Learning-Based Framework to Shorten the Questionnaire for Assessing Autism Intervention
Dong, Audrey, Xu, Claire, Guo, Samuel R., Yang, Kevin, Kong, Xue-Jun
Caregivers of individuals with autism spectrum disorder (ASD) often find the 77-item Autism Treatment Evaluation Checklist (ATEC) burdensome, limiting its use for routine monitoring. This study introduces a generalizable machine learning framework that seeks to shorten assessments while maintaining evaluative accuracy. Using longitudinal ATEC data from 60 autistic children receiving therapy, we applied feature selection and cross-validation techniques to identify the most predictive items across two assessment goals: longitudinal therapy tracking and point-in-time severity estimation. For progress monitoring, the framework identified 16 items (21% of the original questionnaire) that retained strong correlation with total score change and full subdomain coverage. We also generated smaller subsets (1-7 items) for efficient approximations. For point-in-time severity assessment, our model achieved over 80% classification accuracy using just 13 items (17% of the original set). While demonstrated on ATEC, the methodology-based on subset optimization, model interpretability, and statistical rigor-is broadly applicable to other high-dimensional psychometric tools. The resulting framework could potentially enable more accessible, frequent, and scalable assessments and offer a data-driven approach for AI-supported interventions across neurodevelopmental and psychiatric contexts.
Rethinking Metrics and Benchmarks of Video Anomaly Detection
Liu, Zihao, Wu, Xiaoyu, Li, Wenna, Yang, Linlin, Wang, Shengjin
Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development. We release our data and code in https://github.com/Kamino666/RethinkingVAD.
Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization's Impact on ML Fairness
Arcolezi, Hรฉber H., Alishahi, Mina, Bendoukha, Adda-Akram, Kaaniche, Nesrine
Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to fourfold. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Dey, Joyoni, Meyer, Hunter C., Taqi, Murtuza S.
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels. Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication
Kim, Deok-Seon, Lee, Seo-Hyun, Yin, Kang, Lee, Seong-Whan
Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.
Mind the Gaps: Auditing and Reducing Group Inequity in Large-Scale Mobility Prediction
Kumar, Ashwin, Zhang, Hanyu, Schweidel, David A., Yeoh, William
Next location prediction underpins a growing number of mobility, retail, and public-health applications, yet its societal impacts remain largely unexplored. In this paper, we audit state-of-the-art mobility prediction models trained on a large-scale dataset, highlighting hidden disparities based on user demographics. Drawing from aggregate census data, we compute the difference in predictive performance on racial and ethnic user groups and show a systematic disparity resulting from the underlying dataset, resulting in large differences in accuracy based on location and user groups. To address this, we propose Fairness-Guided Incremental Sampling (FGIS), a group-aware sampling strategy designed for incremental data collection settings. Because individual-level demographic labels are unavailable, we introduce Size-Aware K-Means (SAKM), a clustering method that partitions users in latent mobility space while enforcing census-derived group proportions. This yields proxy racial labels for the four largest groups in the state: Asian, Black, Hispanic, and White. Built on these labels, our sampling algorithm prioritizes users based on expected performance gains and current group representation. This method incrementally constructs training datasets that reduce demographic performance gaps while preserving overall accuracy. Our method reduces total disparity between groups by up to 40\% with minimal accuracy trade-offs, as evaluated on a state-of-art MetaPath2Vec model and a transformer-encoder model. Improvements are most significant in early sampling stages, highlighting the potential for fairness-aware strategies to deliver meaningful gains even in low-resource settings. Our findings expose structural inequities in mobility prediction pipelines and demonstrate how lightweight, data-centric interventions can improve fairness with little added complexity, especially for low-data applications.
Category-Aware Semantic Caching for Heterogeneous LLM Workloads
Wang, Chen, Liu, Xunzhuo, Zhu, Yue, Youssef, Alaa, Nagpurkar, Priya, Chen, Huamin
LLM serving systems process heterogeneous query workloads where different categories exhibit different characteristics. Code queries cluster densely in embedding space while conversational queries distribute sparsely. Content staleness varies from minutes (stock data) to months (code patterns). Query repetition patterns range from power-law (code) to uniform (conversation), producing long tail cache hit rate distributions: high-repetition categories achieve 40-60% hit rates while low-repetition or volatile categories achieve 5-15% hit rates. Vector databases must exclude the long tail because remote search costs (30ms) require 15--20% hit rates to break even, leaving 20-30% of production traffic uncached. Uniform cache policies compound this problem: fixed thresholds cause false positives in dense spaces and miss valid paraphrases in sparse spaces; fixed TTLs waste memory or serve stale data. This paper presents category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category. We present a hybrid architecture separating in-memory HNSW search from external document storage, reducing miss cost from 30ms to 2ms. This reduction makes low-hit-rate categories economically viable (break-even at 3-5% versus 15-20%), enabling cache coverage across the entire workload distribution. Adaptive load-based policies extend this framework to respond to downstream model load, dynamically adjusting thresholds and TTLs to reduce traffic to overloaded models by 9-17% in theoretical projections.