Goto

Collaborating Authors

 malicious url


A Graph-Attentive LSTM Model for Malicious URL Detection

Hossain, Md. Ifthekhar, Arafat, Kazi Abdullah Al, Shepard, Bryce, Craig, Kayd, Parvez, Imtiaz

arXiv.org Artificial Intelligence

Malicious URLs pose significant security risks as they facilitate phishing attacks, distribute malware, and empower attackers to deface websites. Blacklist detection methods fail to identify new or obfuscated URLs because they depend on pre-existing patterns. This work presents a hybrid deep learning model named GNN-GAT-LSTM that combines Graph Neural Networks (GNNs) with Graph Attention Networks (GATs) and Long Short-Term Memory (LSTM) networks. The proposed architecture extracts both the structural and sequential patterns of the features from data. The model transforms URLs into graphs through a process where characters become nodes that connect through edges. It applies one-hot encoding to represent node features. The model received training and testing data from a collection of 651,191 URLs, which were classified into benign, phishing, defacement, and malware categories. The preprocessing stage included both feature engineering and data balancing techniques, which addressed the class imbalance issue to enhance model learning. The GNN-GAT-LSTM model achieved outstanding performance through its test accuracy of 0.9806 and its weighted F1-score of 0.9804. It showed excellent precision and recall performance across most classes, particularly for benign and defacement URLs. Overall, the model provides an efficient and scalable system for detecting malicious URLs while demonstrating strong potential for real-world cybersecurity applications.


Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Chen, Zhiyang, Saba, Tara, Deng, Xun, Si, Xujie, Long, Fan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT -4o, GPT -4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.


From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories

Tian, Ye, Yu, Yanqiu, Sun, Jianguo, Wang, Yanbin

arXiv.org Artificial Intelligence

Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews exhibit 4 critical gaps: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g. Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (URL, HTML, Visual, etc.). This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze: 1) publicly available datasets (2016-2024), and 2) open-source implementations from published works(2013-2025). Then, we outline essential design principles and architectural frameworks for product-level implementations. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curating datasets and open-source implementations: https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master.


Browser Extension for Fake URL Detection

Malik, Latesh G., Shambharkar, Rohini, Morey, Shivam, Kanpate, Shubhlak, Raut, Vedika

arXiv.org Artificial Intelligence

In recent years, Cyber attacks have increased in number, and with them, the intensity of the attacks and their potential to damage the user have also increased significantly. In an ever-advancing world, users find it difficult to keep up with the latest developments in technology, which can leave them vulnerable to attacks. To avoid such situations we need tools to deter such attacks, for this machine learning models are among the best options. This paper presents a Browser Extension that uses machine learning models to enhance online security by integrating three crucial functionalities: Malicious URL detection, Spam Email detection and Network logs analysis. The proposed solution uses LGBM classifier for classification of Phishing websites, the model has been trained on a dataset with 87 features, this model achieved an accuracy of 96.5% with a precision of 96.8% and F1 score of 96.49%. The Model for Spam email detection uses Multinomial NB algorithm which has been trained on a dataset with over 5500 messages, this model achieved an accuracy of 97.09% with a precision of 100%. The results demonstrate the effectiveness of using machine learning models for cyber security.


Hybrid Machine Learning Approach For Real-Time Malicious Url Detection Using Som-Rmo And Rbfn With Tabu Search Optimization

T, Swetha, M, Seshaiah, KL, Hemalatha, BH, ManjunathaKumar, SVN, Murthy

arXiv.org Artificial Intelligence

The proliferation of malicious URLs has become a significant threat to internet security, encompassing SPAM, phishing, malware, and defacement attacks. Traditional detection methods struggle to keep pace with the evolving nature of these threats. Detecting malicious URLs in real-time requires advanced techniques capable of handling large datasets and identifying novel attack patterns. The challenge lies in developing a robust model that combines efficient feature extraction with accurate classification. We propose a hybrid machine learning approach combining Self-Organizing Map based Radial Movement Optimization (SOM-RMO) for feature extraction and Radial Basis Function Network (RBFN) based Tabu Search for classification. SOM-RMO effectively reduces dimensionality and highlights significant features, while RBFN, optimized with Tabu Search, classifies URLs with high precision. The proposed model demonstrates superior performance in detecting various malicious URL attacks. On a benchmark dataset, our approach achieved an accuracy of 96.5%, precision of 95.2%, recall of 94.8%, and an F1-score of 95.0%, outperforming traditional methods significantly.


Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method

Maftoun, Mohammad, Shadkam, Nima, Komamardakhi, Seyedeh Somayeh Salehi, Mansor, Zulkefli, Joloudari, Javad Hassannataj

arXiv.org Artificial Intelligence

Trusting the accuracy of data inputted on online platforms can be difficult due to the possibility of malicious websites gathering information for unlawful reasons. Analyzing each website individually becomes challenging with the presence of such malicious sites, making it hard to efficiently list all Uniform Resource Locators (URLs) on a blacklist. This ongoing challenge emphasizes the crucial need for strong security measures to safeguard against potential threats and unauthorized data collection. To detect the risk posed by malicious websites, it is proposed to utilize Machine Learning (ML)-based techniques. To this, we used several ML techniques such as Hist Gradient Boosting Classifier (HGBC), K-Nearest Neighbor (KNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM) for detection of the benign and malicious website dataset. The dataset used contains 1781 records of malicious and benign website data with 13 features. First, we investigated missing value imputation on the dataset. Then, we normalized this data by scaling to a range of zero and one. Next, we utilized the Synthetic Minority Oversampling Technique (SMOTE) to balance the training data since the data set was unbalanced. After that, we applied ML algorithms to the balanced training set. Meanwhile, all algorithms were optimized based on grid search. Finally, the models were evaluated based on accuracy, precision, recall, F1 score, and the Area Under the Curve (AUC) metrics. The results demonstrated that the HGBC classifier has the best performance in terms of the mentioned metrics compared to the other classifiers.


Mitigating Label Flipping Attacks in Malicious URL Detectors Using Ensemble Trees

Nowroozi, Ehsan, Jadalla, Nada, Ghelichkhani, Samaneh, Jolfaei, Alireza

arXiv.org Artificial Intelligence

Malicious URLs provide adversarial opportunities across various industries, including transportation, healthcare, energy, and banking which could be detrimental to business operations. Consequently, the detection of these URLs is of crucial importance; however, current Machine Learning (ML) models are susceptible to backdoor attacks. These attacks involve manipulating a small percentage of training data labels, such as Label Flipping (LF), which changes benign labels to malicious ones and vice versa. This manipulation results in misclassification and leads to incorrect model behavior. Therefore, integrating defense mechanisms into the architecture of ML models becomes an imperative consideration to fortify against potential attacks. The focus of this study is on backdoor attacks in the context of URL detection using ensemble trees. By illuminating the motivations behind such attacks, highlighting the roles of attackers, and emphasizing the critical importance of effective defense strategies, this paper contributes to the ongoing efforts to fortify ML models against adversarial threats within the ML domain in network security. We propose an innovative alarm system that detects the presence of poisoned labels and a defense mechanism designed to uncover the original class labels with the aim of mitigating backdoor attacks on ensemble tree classifiers. We conducted a case study using the Alexa and Phishing Site URL datasets and showed that LF attacks can be addressed using our proposed defense mechanism. Our experimental results prove that the LF attack achieved an Attack Success Rate (ASR) between 50-65% within 2-5%, and the innovative defense method successfully detected poisoned labels with an accuracy of up to 100%.


Detection of Malicious Websites Using Machine Learning Techniques

Oshingbesan, Adebayo, Ekoh, Courage, Okobi, Chukwuemeka, Munezero, Aime, Richard, Kagame

arXiv.org Artificial Intelligence

In detecting malicious websites, a common approach is the use of blacklists which are not exhaustive in themselves and are unable to generalize to new malicious sites. Detecting newly encountered malicious websites automatically will help reduce the vulnerability to this form of attack. In this study, we explored the use of ten machine learning models to classify malicious websites based on lexical features and understand how they generalize across datasets. Specifically, we trained, validated, and tested these models on different sets of datasets and then carried out a cross-datasets analysis. From our analysis, we found that K-Nearest Neighbor is the only model that performs consistently high across datasets. Other models such as Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines also consistently outperform a baseline model of predicting every link as malicious across all metrics and datasets. Also, we found no evidence that any subset of lexical features generalizes across models or datasets. This research should be relevant to cybersecurity professionals and academic researchers as it could form the basis for real-life detection systems or further research work.