Performance Analysis
CCE: Confidence-Consistency Evaluation for Time Series Anomaly Detection
Zhong, Zhijie, Yu, Zhiwen, Cheung, Yiu-ming, Yang, Kaixiang
Time Series Anomaly Detection metrics serve as crucial tools for model evaluation. However, existing metrics suffer from several limitations: insufficient discriminative power, strong hyperparameter dependency, sensitivity to perturbations, and high computational overhead. This paper introduces Confidence-Consistency Evaluation (CCE), a novel evaluation metric that simultaneously measures prediction confidence and uncertainty consistency. By employing Bayesian estimation to quantify the uncertainty of anomaly scores, we construct both global and event-level confidence and consistency scores for model predictions, resulting in a concise CCE metric. Theoretically and experimentally, we demonstrate that CCE possesses strict boundedness, Lipschitz robustness against score perturbations, and linear time complexity $\mathcal{O}(n)$. Furthermore, we establish RankEval, a benchmark for comparing the ranking capabilities of various metrics. RankEval represents the first standardized and reproducible evaluation pipeline that enables objective comparison of evaluation metrics. Both CCE and RankEval implementations are fully open-source.
Diagnosing Psychiatric Patients: Can Large Language and Machine Learning Models Perform Effectively in Emergency Cases?
Ahammed, Abu Shad, Mukherjee, Sayeri, Obermaisser, Roman
Mental disorders are clinically significant patterns of behavior that are associated with stress and/or impairment in social, occupational, or family activities. People suffering from such disorders are often misjudged and poorly diagnosed due to a lack of visible symptoms compared to other health complications. During emergency situations, identifying psychiatric issues is that's why challenging but highly required to save patients. In this paper, we have conducted research on how traditional machine learning and large language models (LLM) can assess these psychiatric patients based on their behavioral patterns to provide a diagnostic assessment. Data from emergency psychiatric patients were collected from a rescue station in Germany. Various machine learning models, including Llama 3.1, were used with rescue patient data to assess if the predictive capabilities of the models can serve as an efficient tool for identifying patients with unhealthy mental disorders, especially in rescue cases.
Multitask Battery Management with Flexible Pretraining
Lu, Hong, Chen, Jiali, Zhang, Jingzhao, He, Guannan, Han, Xuebing, Ouyang, Minggao
Industrial-scale battery management involves various types of tasks, such as estimation, prediction, and system-level diagnostics. Each task employs distinct data across temporal scales, sensor resolutions, and data channels. Building task-specific methods requires a great deal of data and engineering effort, which limits the scalability of intelligent battery management. Here we present the Flexible Masked Autoencoder (FMAE), a flexible pretraining framework that can learn with missing battery data channels and capture inter-correlations across data snippets. FMAE learns unified battery representations from heterogeneous data and can be adopted by different tasks with minimal data and engineering efforts. Experimentally, FMAE consistently outperforms all task-specific methods across five battery management tasks with eleven battery datasets. On remaining life prediction tasks, FMAE uses 50 times less inference data while maintaining state-of-the-art results. Moreover, when real-world data lack certain information, such as system voltage, FMAE can still be applied with marginal performance impact, achieving comparable results with the best hand-crafted features. FMAE demonstrates a practical route to a flexible, data-efficient model that simplifies real-world multi-task management of dynamical systems.
Predicting NCAP Safety Ratings: An Analysis of Vehicle Characteristics and ADAS Features Using Machine Learning
Kunwar, Raunak, LeBoulluec, Aera Kim
Vehicle safety assessment is crucial for consumer information and regulatory oversight. The New Car Assessment Program (NCAP) assigns standardized safety ratings, which traditionally emphasize passive safety measures but now include active safety technologies such as Advanced Driver-Assistance Systems (ADAS). It is crucial to understand how these various systems interact empirically. This study explores whether particular ADAS features like Forward Collision Warning, Lane Departure Warning, Crash Imminent Braking, and Blind Spot Detection, together with established vehicle attributes (e.g., Curb Weight, Model Year, Vehicle Type, Drive Train), can reliably predict a vehicle's likelihood of earning the highest (5-star) overall NCAP rating. Using a publicly available dataset derived from NCAP reports that contain approximately 5,128 vehicle variants spanning model years 2011-2025, we compared four different machine learning models: logistic regression, random forest, gradient boosting, and support vector classifier (SVC) using a 5-fold stratified cross-validation approach. The two best-performing algorithms (random forest and gradient boost) were hyperparameter optimized using RandomizedSearchCV. Analysis of feature importance showed that basic vehicle characteristics, specifically curb weight and model year, dominated predictive capability, contributing more than 55% of the feature relevance of the Random Forest model. However, the inclusion of ADAS features also provided meaningful predictive contributions. The optimized Random Forest model achieved robust results on a held-out test set, with an accuracy of 89.18% and a ROC AUC of 0.9586. This research reveals the use of machine learning to analyze large-scale NCAP data and highlights the combined predictive importance of both established vehicle parameters and modern ADAS features to achieve top safety ratings.
Predicting Multi-Type Talented Students in Secondary School Using Semi-Supervised Machine Learning
Zheng, Xinzhe, Yang, Zhen-Qun, Cao, Jiannong, Cheng, Jiabei
--T alent identification plays a critical role in promoting student development. However, traditional approaches often rely on manual processes or focus narrowly on academic achievement, and typically delaying intervention until the higher education stage. This oversight overlooks diverse non-academic talents and misses opportunities for early intervention. T o address this gap, this study introduces T alentPredictor, a novel semi-supervised multi-modal neural network that combines Transformer, LSTM, and ANN architectures. This model is designed to predict seven different talent types--academic, sport, art, leadership, service, technology, and others--in secondary school students within an offline educational setting. Drawing on existing offline educational data from 1,041 local secondary students, T alentPredictor overcomes the limitations of traditional talent identification methods. By clustering various award records into talent categories and extracting features from students' diverse learning behaviors, it achieves high prediction accuracy (0.908 classification accuracy, 0.908 ROCAUC). This demonstrates the potential of machine learning to identify diverse talents early in student development. ALENT is a critical component in human society. It is indispensable to the development of societies and the competitiveness of countries. Last but not least, talent is always in high demand. Thus, nurturing talent is the top priority for every part of the earth, and in it, talent identification is the foundation, as you must have a target individual to nurture talent. Traditional talent identification aims to give students tests that exceed their current level. For example, give grade eight students college admissions tests and use the result of the tough test as a talent score.
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering
Tang, Zirui, Niu, Boyu, Zhou, Xuanhe, Li, Boxiu, Zhou, Wei, Wang, Jiannan, Li, Guoliang, Zhang, Xinyi, Wu, Fan
Semi-structured tables, widely used in real-world applications (e.g., financial reports, medical records, transactional orders), often involve flexible and complex layouts (e.g., hierarchical headers and merged cells). These tables generally rely on human analysts to interpret table layouts and answer relevant natural language questions, which is costly and inefficient. To automate the procedure, existing methods face significant challenges. First, methods like NL2SQL require converting semi-structured tables into structured ones, which often causes substantial information loss. Second, methods like NL2Code and multi-modal LLM QA struggle to understand the complex layouts of semi-structured tables and cannot accurately answer corresponding questions. To this end, we propose ST-Raptor, a tree-based framework for semi-structured table question answering using large language models. First, we introduce the Hierarchical Orthogonal Tree (HO-Tree), a structural model that captures complex semi-structured table layouts, along with an effective algorithm for constructing the tree. Second, we define a set of basic tree operations to guide LLMs in executing common QA tasks. Given a user question, ST-Raptor decomposes it into simpler sub-questions, generates corresponding tree operation pipelines, and conducts operation-table alignment for accurate pipeline execution. Third, we incorporate a two-stage verification mechanism: forward validation checks the correctness of execution steps, while backward validation evaluates answer reliability by reconstructing queries from predicted answers. To benchmark the performance, we present SSTQA, a dataset of 764 questions over 102 real-world semi-structured tables. Experiments show that ST-Raptor outperforms nine baselines by up to 20% in answer accuracy. The code is available at https://github.com/weAIDB/ST-Raptor.
Label Unification for Cross-Dataset Generalization in Cybersecurity NER
Jalocha, Maciej, Schmidt, Johan Hausted, Michelseen, William
The field of cybersecurity NER lacks standardized labels, making it challenging to combine datasets. We investigate label unification across four cybersecurity datasets to increase data resource usability. We perform a coarse-grained label unification and conduct pairwise cross-dataset evaluations using BiLSTM models. Qualitative analysis of predictions reveals errors, limitations, and dataset differences. To address unification limitations, we propose alternative architectures including a multihead model and a graph-based transfer model. Results show that models trained on unified datasets generalize poorly across datasets. The multihead model with weight sharing provides only marginal improvements over unified training, while our graph-based transfer model built on BERT-base-NER shows no significant performance gains compared BERT-base-NER.
CytoDiff: AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics
Boada, Jan Carreras, Umer, Rao Muhammad, Marr, Carsten
Biomedical datasets are often constrained by stringent privacy requirements and frequently suffer from severe class imbalance. These two aspects hinder the development of accurate machine learning models. While generative AI offers a promising solution, producing synthetic images of sufficient quality for training robust classifiers remains challenging. This work addresses the classification of individual white blood cells, a critical task in diagnosing hematological malignancies such as acute myeloid leukemia (AML). We introduce CytoDiff, a stable diffusion model fine-tuned with LoRA weights and guided by few-shot samples that generates high-fidelity synthetic white blood cell images. Our approach demonstrates substantial improvements in classifier performance when training data is limited. Using a small, highly imbalanced real dataset, the addition of 5,000 synthetic images per class improved ResNet classifier accuracy from 27\% to 78\% (+51\%). Similarly, CLIP-based classification accuracy increased from 62\% to 77\% (+15\%). These results establish synthetic image generation as a valuable tool for biomedical machine learning, enhancing data coverage and facilitating secure data sharing while preserving patient privacy. Paper code is publicly available at https://github.com/JanCarreras24/CytoDiff.
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Zhang, Ming, Shen, Yujiong, Li, Zelin, Sha, Huayu, Hu, Binze, Wang, Yuhui, Huang, Chenhao, Liu, Shichun, Tong, Jingqi, Jiang, Changhao, Chai, Mingxu, Xi, Zhiheng, Dou, Shihan, Gui, Tao, Zhang, Qi, Huang, Xuanjing
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med.
A versatile foundation model for cine cardiac magnetic resonance image analysis tasks
Fu, Yunguan, Bai, Wenjia, Yi, Weixi, Manisty, Charlotte, Bhuva, Anish N, Treibel, Thomas A, Moon, James C, Clarkson, Matthew J, Davies, Rhodri Huw, Hu, Yipeng
Here we present a versatile foundation model that can perform a range of clinically-relevant image analysis tasks, including segmentation, landmark localisation, diagnosis, and prognostication. A multi-view convolution-transformer masked autoencoder, named as CineMA, was trained on 15 million cine images from 74,916 subjects. The model was validated on multiple image analysis tasks and compared to existing models on >4,500 images from eight independent datasets with diverse population characteristics, representing the largest benchmark study for cine CMR so far. CineMA consistently outperformed conventional convolutional neural networks (CNNs) in delineating ventricular boundaries and estimating ejection fraction, a key measure of cardiac function. The improved performance was preserved, even when the model only used half of fine-tuning data. CineMA also surpassed CNNs in disease detection and matched their performance in long-axis function measurement. Interestingly, we found that CineMA can also detect cardiac changes in systemic diseases, such as diabetes, hypertension and cancer, and can also predict mortality. Finally, we assessed model fairness and demonstrated consistent model performance across demographic subgroups. These findings highlight CineMA's accuracy, learning efficiency, adaptability, and fairness, underscoring its potential as a foundation model for automated cardiac image analysis to support clinical workflow and cardiovascular research. All training and inference code and models are made publicly available at https://github.com/mathpluscode/CineMA.