AITopics

Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift -- distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection. To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013-2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges. The dataset and code are available at: https://iqsec-lab.github.io/LAMDA/.

artificial intelligence, dataset, machine learning, (16 more...)

2505.18551

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Preserving AUC Fairness in Learning with Noisy Protected Groups

Wu, Mingyang, Lin, Li, Zhang, Wenbin, Wang, Xin, Yang, Zhenhuan, Hu, Shu

The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in https://github.com/Purdue-M2/AUC_Fairness_with_Noisy_Groups.

artificial intelligence, fairness, machine learning, (11 more...)

2505.18532

Country: North America (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (0.68)
Health & Medicine > Diagnostic Medicine > Imaging (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(3 more...)

Zhang, Yiqing, Liu, Xiaozhong, Murai, Fabricio

CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs

Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from https://github.com/murai-lab/CLaDMoP.

cladmop, large language model, machine learning, (19 more...)

2505.18527

Country: North America > United States (0.94)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Li, Li, Cai, Peilin, Rossi, Ryan A., Dernoncourt, Franck, Kveton, Branislav, Wu, Junda, Yu, Tong, Song, Linxin, Yang, Tiankai, Qin, Yuehan, Ahmed, Nesreen K., Basu, Samyadeep, Mukherjee, Subhojyoti, Zhang, Ruiyi, Hu, Zhengmian, Ni, Bo, Zhou, Yuxiao, Wang, Zichao, Huang, Yue, Wang, Yu, Zhang, Xiangliang, Yu, Philip S., Hu, Xiyang, Zhao, Yue

We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.

classification, large language model, machine learning, (20 more...)

2505.14106

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.46)
Education (0.46)
Media > News (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Toward Malicious Clients Detection in Federated Learning

Dou, Zhihao, Wang, Jiaqi, Sun, Wei, Liu, Zhuqing, Fang, Minghong

Federated learning (FL) enables multiple clients to collaboratively train a global machine learning model without sharing their raw data. However, the decentralized nature of FL introduces vulnerabilities, particularly to poisoning attacks, where malicious clients manipulate their local models to disrupt the training process. While Byzantine-robust aggregation rules have been developed to mitigate such attacks, they remain inadequate against more advanced threats. In response, recent advancements have focused on FL detection techniques to identify potentially malicious participants. Unfortunately, these methods often misclassify numerous benign clients as threats or rely on unrealistic assumptions about the server's capabilities. In this paper, we propose a novel algorithm, SafeFL, specifically designed to accurately identify malicious clients in FL. The SafeFL approach involves the server collecting a series of global models to generate a synthetic dataset, which is then used to distinguish between malicious and benign models based on their behavior. Extensive testing demonstrates that SafeFL outperforms existing methods, offering superior efficiency and accuracy in detecting malicious clients.

artificial intelligence, inductive learning, machine learning, (17 more...)

2505.0911

Country:

Asia (0.70)
North America > United States > Texas (0.27)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.67)

Molaee, Mohammad, Charkari, Nasrollah Moghadam, Ghaderi, Foad

Heterogeneous networks in drug-target interaction prediction

D rug discovery requires a tremendous amount of time and cost. Computational drug - target interaction prediction, a n important part of this process, can reduce these requirements by narrowing the search space for wet lab experiments. In this survey, we provid e comprehensive details of graph machine learning - based methods in predicting drug - target interaction, as they have shown promising results in this field. These details include the overall framework, main contribution, dataset s, and their source code s . The selected papers were mainly published from 2020 to 2024 . Prior to discussing papers, we briefly introduce the datasets commonly used with these methods and measurements to assess their performance. Finally, future challenges and some crucial areas that need to be explored are discussed.

bioinformatics, data mining, machine learning, (22 more...)

2504.16152

Country: Asia (0.28)

Genre:

Research Report (1.00)
Overview (0.68)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(5 more...)

A Critical Evaluation of Defenses against Prompt Injection Attacks

Jia, Yuqi, Shao, Zedian, Liu, Yupei, Jia, Jinyuan, Song, Dawn, Gong, Neil Zhenqiang

Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: https://github.com/PIEval123/PIEval.

effectiveness, large language model, machine learning, (17 more...)

2505.18333

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Dan, Jonathan, Shahbazinia, Amirhossein, Kechris, Christodoulos, Atienza, David

SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference

Reliable automatic seizure detection from long-term EEG remains a challenge, as current machine learning models often fail to generalize across patients or clinical settings. Manual EEG review remains the clinical standard, underscoring the need for robust models and standardized evaluation. To rigorously assess algorithm performance, we organized a challenge using a private dataset of continuous EEG recordings from 65 subjects (4,360 hours). Expert neurophysiologists annotated the data, providing ground truth for seizure events. Participants were required to detect seizure onset and duration, with evaluation based on event-based metrics, including sensitivity, precision, F1-score, and false positives per day. The SzCORE framework ensured standardized evaluation. The primary ranking criterion was the event-based F1-score, reflecting clinical relevance by balancing sensitivity and false positives. The challenge received 30 submissions from 19 teams, with 28 algorithms evaluated. Results revealed wide variability in performance, with a top F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of seizure detection. The challenge also revealed a gap between reported performance and real-world evaluation, emphasizing the importance of rigorous benchmarking. Compared to previous challenges and commercial systems, the best-performing algorithm in this contest showed improved performance. Importantly, the challenge platform now supports continuous benchmarking, enabling reproducible research, integration of new datasets, and clinical evaluation of seizure detection algorithms using a standardized framework.

algorithm, artificial intelligence, machine learning, (16 more...)

2505.18191

Country: Europe (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.85)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Generating Realistic Multi-Beat ECG Signals

Pöhl, Paul, Schlegel, Viktor, Li, Hao, Bharath, Anil

Generating synthetic ECG data has numerous applications in healthcare, from educational purposes to simulating scenarios and forecasting trends. While recent diffusion models excel at generating short ECG segments, they struggle with longer sequences needed for many clinical applications. This paper proposes a novel three-layer synthesis framework for generating realistic long-form ECG signals. We first generate high-fidelity single beats using a diffusion model, then synthesize inter-beat features preserving critical temporal dependencies, and finally assemble beats into coherent long sequences using feature-guided matching. Our comprehensive evaluation demonstrates that the resulting synthetic ECGs maintain both beat-level morphological fidelity and clinically relevant inter-beat relationships. In arrhythmia classification tasks, our long-form synthetic ECGs significantly outperform end-to-end long-form ECG generation using the diffusion model, highlighting their potential for increasing utility for downstream applications. The approach enables generation of unprecedented multi-minute ECG sequences while preserving essential diagnostic characteristics.

artificial intelligence, diffusion model, machine learning, (14 more...)

2505.18189

Country: Europe > United Kingdom (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Interpretable Multi-Task PINN for Emotion Recognition and EDA Prediction

Mandal, Nischal

Understanding and predicting human emotional and physiological states using wearable sensors has critical applications in stress monitoring, mental health assessment, and affective computing. In this study, we present a novel Multi - Task Physics - Informed Neural Network (PINN) that simultaneously performs Electrodermal Activity (EDA) prediction and emotion classification using the publicly available WESAD dataset. Our model integrates psychological self - reports (PANAS and SAM) with a physics - inspired differential formulation of EDA dynamics, enforcing biophysically grounded constraints through a custom loss that balances data - driven learning and physiological interpretability. The architecture supports dual outputs -- regression for EDA and classification for emotional states -- trained under a unified multi - task framework. Evaluated via 5 - fold cross - validation, the proposed method achieves an average EDA RMSE of 0.0362, Pearson correlation (r) of 0.9919, and F1 - score of 94.08%, outperforming both classical baselines (e.g., SVR, XGBoost) and ablated variants such as emotion - only and EDA - only models. Comparative ablation and multi - task experiments show that including both physics constraints and emotion prediction enhances generalization, reduces overfitting, and leads to physiologically consistent outputs. Moreover, the learned physical parameters -- decay rate (α), emotion influence weights (β), and temporal scaling (γ) -- remain interpretable and stable across folds, confirming the alignment between the model's latent representation and known stress - response theory. This is the first work to introduce a multi - task PINN architecture for wearable affective computing, bridging black - box deep learning and domain knowledge. Our framework lays the groundwork for interpretable, multimodal, and deployable systems in healthcare and human - computer interaction.

artificial intelligence, machine learning, physics, (16 more...)

2505.18169

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Consumer Health (0.87)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)