Performance Analysis
LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis
Haque, Md Ahsanul, Hossain, Ismail, Kamol, Md Mahmuduzzaman, Alam, Md Jahangir, Amalapuram, Suresh Kumar, Talukder, Sajedul, Rahman, Mohammad Saidur
Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift -- distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection. To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013-2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges. The dataset and code are available at: https://iqsec-lab.github.io/LAMDA/.
Preserving AUC Fairness in Learning with Noisy Protected Groups
Wu, Mingyang, Lin, Li, Zhang, Wenbin, Wang, Xin, Yang, Zhenhuan, Hu, Shu
The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in https://github.com/Purdue-M2/AUC_Fairness_with_Noisy_Groups.
CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs
Zhang, Yiqing, Liu, Xiaozhong, Murai, Fabricio
Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from https://github.com/murai-lab/CLaDMoP.
A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations
Li, Li, Cai, Peilin, Rossi, Ryan A., Dernoncourt, Franck, Kveton, Branislav, Wu, Junda, Yu, Tong, Song, Linxin, Yang, Tiankai, Qin, Yuehan, Ahmed, Nesreen K., Basu, Samyadeep, Mukherjee, Subhojyoti, Zhang, Ruiyi, Hu, Zhengmian, Ni, Bo, Zhou, Yuxiao, Wang, Zichao, Huang, Yue, Wang, Yu, Zhang, Xiangliang, Yu, Philip S., Hu, Xiyang, Zhao, Yue
We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.
Toward Malicious Clients Detection in Federated Learning
Dou, Zhihao, Wang, Jiaqi, Sun, Wei, Liu, Zhuqing, Fang, Minghong
Federated learning (FL) enables multiple clients to collaboratively train a global machine learning model without sharing their raw data. However, the decentralized nature of FL introduces vulnerabilities, particularly to poisoning attacks, where malicious clients manipulate their local models to disrupt the training process. While Byzantine-robust aggregation rules have been developed to mitigate such attacks, they remain inadequate against more advanced threats. In response, recent advancements have focused on FL detection techniques to identify potentially malicious participants. Unfortunately, these methods often misclassify numerous benign clients as threats or rely on unrealistic assumptions about the server's capabilities. In this paper, we propose a novel algorithm, SafeFL, specifically designed to accurately identify malicious clients in FL. The SafeFL approach involves the server collecting a series of global models to generate a synthetic dataset, which is then used to distinguish between malicious and benign models based on their behavior. Extensive testing demonstrates that SafeFL outperforms existing methods, offering superior efficiency and accuracy in detecting malicious clients.
Heterogeneous networks in drug-target interaction prediction
Molaee, Mohammad, Charkari, Nasrollah Moghadam, Ghaderi, Foad
D rug discovery requires a tremendous amount of time and cost. Computational drug - target interaction prediction, a n important part of this process, can reduce these requirements by narrowing the search space for wet lab experiments. In this survey, we provid e comprehensive details of graph machine learning - based methods in predicting drug - target interaction, as they have shown promising results in this field. These details include the overall framework, main contribution, dataset s, and their source code s . The selected papers were mainly published from 2020 to 2024 . Prior to discussing papers, we briefly introduce the datasets commonly used with these methods and measurements to assess their performance. Finally, future challenges and some crucial areas that need to be explored are discussed.
A Critical Evaluation of Defenses against Prompt Injection Attacks
Jia, Yuqi, Shao, Zedian, Liu, Yupei, Jia, Jinyuan, Song, Dawn, Gong, Neil Zhenqiang
Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: https://github.com/PIEval123/PIEval.
SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference
Dan, Jonathan, Shahbazinia, Amirhossein, Kechris, Christodoulos, Atienza, David
Reliable automatic seizure detection from long-term EEG remains a challenge, as current machine learning models often fail to generalize across patients or clinical settings. Manual EEG review remains the clinical standard, underscoring the need for robust models and standardized evaluation. To rigorously assess algorithm performance, we organized a challenge using a private dataset of continuous EEG recordings from 65 subjects (4,360 hours). Expert neurophysiologists annotated the data, providing ground truth for seizure events. Participants were required to detect seizure onset and duration, with evaluation based on event-based metrics, including sensitivity, precision, F1-score, and false positives per day. The SzCORE framework ensured standardized evaluation. The primary ranking criterion was the event-based F1-score, reflecting clinical relevance by balancing sensitivity and false positives. The challenge received 30 submissions from 19 teams, with 28 algorithms evaluated. Results revealed wide variability in performance, with a top F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of seizure detection. The challenge also revealed a gap between reported performance and real-world evaluation, emphasizing the importance of rigorous benchmarking. Compared to previous challenges and commercial systems, the best-performing algorithm in this contest showed improved performance. Importantly, the challenge platform now supports continuous benchmarking, enabling reproducible research, integration of new datasets, and clinical evaluation of seizure detection algorithms using a standardized framework.
Generating Realistic Multi-Beat ECG Signals
Pöhl, Paul, Schlegel, Viktor, Li, Hao, Bharath, Anil
Generating synthetic ECG data has numerous applications in healthcare, from educational purposes to simulating scenarios and forecasting trends. While recent diffusion models excel at generating short ECG segments, they struggle with longer sequences needed for many clinical applications. This paper proposes a novel three-layer synthesis framework for generating realistic long-form ECG signals. We first generate high-fidelity single beats using a diffusion model, then synthesize inter-beat features preserving critical temporal dependencies, and finally assemble beats into coherent long sequences using feature-guided matching. Our comprehensive evaluation demonstrates that the resulting synthetic ECGs maintain both beat-level morphological fidelity and clinically relevant inter-beat relationships. In arrhythmia classification tasks, our long-form synthetic ECGs significantly outperform end-to-end long-form ECG generation using the diffusion model, highlighting their potential for increasing utility for downstream applications. The approach enables generation of unprecedented multi-minute ECG sequences while preserving essential diagnostic characteristics.
Interpretable Multi-Task PINN for Emotion Recognition and EDA Prediction
Understanding and predicting human emotional and physiological states using wearable sensors has critical applications in stress monitoring, mental health assessment, and affective computing. In this study, we present a novel Multi - Task Physics - Informed Neural Network (PINN) that simultaneously performs Electrodermal Activity (EDA) prediction and emotion classification using the publicly available WESAD dataset. Our model integrates psychological self - reports (PANAS and SAM) with a physics - inspired differential formulation of EDA dynamics, enforcing biophysically grounded constraints through a custom loss that balances data - driven learning and physiological interpretability. The architecture supports dual outputs -- regression for EDA and classification for emotional states -- trained under a unified multi - task framework. Evaluated via 5 - fold cross - validation, the proposed method achieves an average EDA RMSE of 0.0362, Pearson correlation (r) of 0.9919, and F1 - score of 94.08%, outperforming both classical baselines (e.g., SVR, XGBoost) and ablated variants such as emotion - only and EDA - only models. Comparative ablation and multi - task experiments show that including both physics constraints and emotion prediction enhances generalization, reduces overfitting, and leads to physiologically consistent outputs. Moreover, the learned physical parameters -- decay rate (α), emotion influence weights (β), and temporal scaling (γ) -- remain interpretable and stable across folds, confirming the alignment between the model's latent representation and known stress - response theory. This is the first work to introduce a multi - task PINN architecture for wearable affective computing, bridging black - box deep learning and domain knowledge. Our framework lays the groundwork for interpretable, multimodal, and deployable systems in healthcare and human - computer interaction.