Goto

Collaborating Authors

 f1-score



engGNN: A Dual-Graph Neural Network for Omics-Based Disease Classification and Feature Selection

Yang, Tiantian, Wang, Yuxuan, Zhou, Zhenwei, Liu, Ching-Ti

arXiv.org Machine Learning

Omics data, such as transcriptomics, proteomics, and metabolomics, provide critical insights into disease mechanisms and clinical outcomes. However, their high dimensionality, small sample sizes, and intricate biological networks pose major challenges for reliable prediction and meaningful interpretation. Graph Neural Networks (GNNs) offer a promising way to integrate prior knowledge by encoding feature relationships as graphs. Yet, existing methods typically rely solely on either an externally curated feature graph or a data-driven generated one, which limits their ability to capture complementary information. To address this, we propose the external and generated Graph Neural Network (engGNN), a dual-graph framework that jointly leverages both external known biological networks and data-driven generated graphs. Specifically, engGNN constructs a biologically informed undirected feature graph from established network databases and complements it with a directed feature graph derived from tree-ensemble models. This dual-graph design produces more comprehensive embeddings, thereby improving predictive performance and interpretability. Through extensive simulations and real-world applications to gene expression data, engGNN consistently outperforms state-of-the-art baselines. Beyond classification, engGNN provides interpretable feature importance scores that facilitate biologically meaningful discoveries, such as pathway enrichment analysis. Taken together, these results highlight engGNN as a robust, flexible, and interpretable framework for disease classification and biomarker discovery in high-dimensional omics contexts.


Benchmarking data encoding methods in Quantum Machine Learning

Zang, Orlane, Barrué, Grégoire, Quertier, Tony

arXiv.org Artificial Intelligence

Quantum Machine Learning (QML) is a research area that focuses on the development of Machine Learning (ML) algorithms that can be executed by a quantum computer. Taking advantage of quantum phenomena such as quantum superposition and quantum entanglement, the incorporation of quantum computing in ML aims to leverage the power of the quantum computer to improve existing classical ML algorithms [1]. The process of manipulating the states of qubits by arbitrarily changing the gate parameters for the desired result is closely related to the training process of machine learning algorithms. For solving any specific problem, QML algorithms can be designed as a quantum circuit with a sequence of different quantum gate operations [2][3]. However, Noisy Intermediate Scale Quantum (NISQ) computers are limited in resources and subject to sources of error, such as noise induced by each quantum operation [4][5]. This makes it difficult to develop QML algorithms that perform as well as or better than conventional ones. Hence, developing QML algorithms with good performance, despite today's limited resources, has become a major challenge. To achieve this, it is important to look at several aspects of a QML task, such that the data encoding part. Quantum encoding involves the conversion of classical information into quantum states, enabling QML algorithms to operate efficiently.


Llama-based source code vulnerability detection: Prompt engineering vs Fine tuning

Ouchebara, Dyna Soumhane, Dupont, Stéphane

arXiv.org Artificial Intelligence

The significant increase in software production, driven by the acceleration of development cycles over the past two decades, has led to a steady rise in software vulnerabilities, as shown by statistics published yearly by the CVE program. The automation of the source code vulnerability detection (CVD) process has thus become essential, and several methods have been proposed ranging from the well established program analysis techniques to the more recent AI-based methods. Our research investigates Large Language Models (LLMs), which are considered among the most performant AI models to date, for the CVD task. The objective is to study their performance and apply different state-of-the-art techniques to enhance their effectiveness for this task. We explore various fine-tuning and prompt engineering settings. We particularly suggest one novel approach for fine-tuning LLMs which we call Double Fine-tuning, and also test the understudied Test-Time fine-tuning approach. We leverage the recent open-source Llama-3.1 8B, with source code samples extracted from BigVul and PrimeVul datasets. Our conclusions highlight the importance of fine-tuning to resolve the task, the performance of Double tuning, as well as the potential of Llama models for CVD. Though prompting proved ineffective, Retrieval augmented generation (RAG) performed relatively well as an example selection technique. Overall, some of our research questions have been answered, and many are still on hold, which leaves us many future work perspectives. Code repository is available here: https://github.com/DynaSoumhaneOuchebara/Llama-based-vulnerability-detection.


Cytoplasmic Strings Analysis in Human Embryo Time-Lapse Videos using Deep Learning Framework

Sohail, Anabia, Alansari, Mohamad, Abughali, Ahmed, Chehab, Asmaa, Ahmed, Abdelfatah, Velayudhan, Divya, Javed, Sajid, Marzouqi, Hasan Al, Al-Sumaiti, Ameena Saad, Kashir, Junaid, Werghi, Naoufel

arXiv.org Artificial Intelligence

Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: https://github.com/HamadYA/CS_Detection.


FedLAD: A Modular and Adaptive Testbed for Federated Log Anomaly Detection

Liao, Yihan, Keung, Jacky, Mao, Zhenyu, Zhang, Jingyu, Li, Jialong

arXiv.org Artificial Intelligence

Log-based anomaly detection (LAD) is critical for ensuring the reliability of large-scale distributed systems. However, most existing LAD approaches assume centralized training, which is often impractical due to privacy constraints and the decentralized nature of system logs. While federated learning (FL) offers a promising alternative, there is a lack of dedicated testbeds tailored to the needs of LAD in federated settings. To address this, we present FedLAD, a unified platform for training and evaluating LAD models under FL constraints. FedLAD supports plug-and-play integration of diverse LAD models, benchmark datasets, and aggregation strategies, while offering runtime support for validation logging (self-monitoring), parameter tuning (self-configuration), and adaptive strategy control (self-adaptation). By enabling reproducible and scalable experimentation, FedLAD bridges the gap between FL frameworks and LAD requirements, providing a solid foundation for future research. Project code is publicly available at: https://github.com/AA-cityu/FedLAD.


RGE-GCN: Recursive Gene Elimination with Graph Convolutional Networks for RNA-seq based Early Cancer Detection

Shende, Shreyas, Narayanan, Varsha, Fenn, Vishal, Huang, Yiran, Goksuluk, Dincer, Choudhary, Gaurav, Agraz, Melih, Xu, Mengjia

arXiv.org Artificial Intelligence

Early detection of cancer plays a key role in improving survival rates, but identifying reliable biomarkers from RNA-seq data is still a major challenge. The data are high-dimensional, and conventional statistical methods often fail to capture the complex relationships between genes. In this study, we introduce RGE-GCN (Recursive Gene Elimination with Graph Convolutional Networks), a framework that combines feature selection and classification in a single pipeline. Our approach builds a graph from gene expression profiles, uses a Graph Convolutional Network to classify cancer versus normal samples, and applies Integrated Gradients to highlight the most informative genes. By recursively removing less relevant genes, the model converges to a compact set of biomarkers that are both interpretable and predictive. We evaluated RGE-GCN on synthetic data as well as real-world RNA-seq cohorts of lung, kidney, and cervical cancers. Across all datasets, the method consistently achieved higher accuracy and F1-scores than standard tools such as DESeq2, edgeR, and limma-voom. Importantly, the selected genes aligned with well-known cancer pathways including PI3K-AKT, MAPK, SUMOylation, and immune regulation. These results suggest that RGE-GCN shows promise as a generalizable approach for RNA-seq based early cancer detection and biomarker discovery (https://rce-gcn.streamlit.app/ ).


BEACON: A Unified Behavioral-Tactical Framework for Explainable Cybercrime Analysis with Large Language Models

Sachdeva, Arush, Saravanan, Rajendraprasad, Sarkar, Gargi, Vemuri, Kavita, Shukla, Sandeep Kumar

arXiv.org Artificial Intelligence

Cybercrime has emerged as one of the most pervasive and economically destructive consequences of global digitalization. Contemporary online fraud and deception-based crimes now account for unprecedented financial losses worldwide, exceeding trillions of United States dollars (USD) annually (Morgan, 2016), while also inflicting severe psychological, social, and reputational harm on victims. Unlike classical cyberattacks targeting systems and networks, modern cybercrime increasingly exploits human vulnerabilities rather than purely technical weaknesses, relying on deception, persuasion, impersonation, emotional coercion, and trust manipulation as primary attack vectors (Holt, 2019; Yao, Zheng, Wu, Wu, Gao, Wang and Yang, 2025; Sarkar and Shukla, 2023; Sarkar, Singh, Kumar and Shukla, 2023). Existing cybersecurity frameworks, such as the Cyber Kill Chain and the MITRE ATT&CK framework, provide powerful abstractions for understanding technically sophisticated cyberattacks targeting enterprise systems and critical infrastructure (MITRE Corporation, 2025b,a). However, these models are fundamentally system-centric: they describe how adversaries compromise digital infrastructure, escalate privileges, and exfiltrate data. In contrast, cybercrime, particularly scams, fraud, impersonation, and extortion, primarily targets individual decision-making processes (Louderback and Antonaccio, 2017), often without exploiting any software vulnerability at all. Consequently, the investigative needs of cybercrime differ substantially from those of traditional cyberattacks.


Enhancing Dimensionality Prediction in Hybrid Metal Halides via Feature Engineering and Class-Imbalance Mitigation

Karabin, Mariia, Armstrong, Isaac, Beck, Leo, Apanel, Paulina, Eisenbach, Markus, Mitzi, David B., Terletska, Hanna, Heinz, Hendrik

arXiv.org Artificial Intelligence

We present a machine learning framework for predicting the structural dimensionality of hybrid metal halides (HMHs), including organic-inorganic perovskites, using a combination of chemically-informed feature engineering and advanced class-imbalance handling techniques. The dataset, consisting of 494 HMH structures, is highly imbalanced across dimensionality classes (0D, 1D, 2D, 3D), posing significant challenges to predictive modeling. This dataset was later augmented to 1336 via the Synthetic Minority Oversampling Technique (SMOTE) to mitigate the effects of the class imbalance. We developed interaction-based descriptors and integrated them into a multi-stage workflow that combines feature selection, model stacking, and performance optimization to improve dimensionality prediction accuracy. Our approach significantly improves F1-scores for underrepresented classes, achieving robust cross-validation performance across all dimensionalities.


Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection Against LLM-Generated Threats

Shahriar, Sadat, Ayoobi, Navid, Mukherjee, Arjun, Musharrat, Mostafa, Vamsi, Sai Vishnu

arXiv.org Artificial Intelligence

The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.