MH-1M: A 1.34 Million-Sample Comprehensive Multi-Feature Android Malware Dataset for Machine Learning, Deep Learning, Large Language Models, and Threat Intelligence Research
Braganca, Hendrio, Kreutz, Diego, Rocha, Vanderson, Assolin, Joner, Feitosa, and Eduardo
–arXiv.org Artificial Intelligence
Abstract--We present MH-1M, one of the most comprehensive and up-to-date datasets for advanced Android malware research. The dataset comprises 1,340,515 applications, encompassing a wide range of features and extensive metadata. T o ensure accurate malware classification, we employ the VirusT otal API, integrating multiple detection engines for comprehensive and reliable assessment. Our GitHub, Figshare, and Harvard Dataverse repositories provide open access to the processed dataset and its extensive supplementary metadata, totaling more than 400 GB of data and including the outputs of the feature extraction pipeline as well as the corresponding VirusT otal reports. Our findings underscore the MH-1M dataset's invaluable role in understanding the evolving landscape of malware. The pervasive spread of Android malware poses a significant challenge for cybersecurity research. This challenge stems mainly from the open-source nature and affordability of Android platforms, which grant users access to a large market of free applications. At the same time, malware continually evolves, adapting its tactics to execute more sophisticated and frequent attacks. Such attacks often result in data destruction, information theft, and several other cybercrimes [1], [2], [3]. Machine learning (ML) algorithms have been widely used to uncover malware and have demonstrated remarkable effectiveness in detection systems, leveraging their discriminative capabilities to identify new variants of malicious applications [4], [5], [6]. To mitigate these risks, researchers have developed a variety of methods for detecting Android malware, establishing machine learning as a central focus of contemporary mobile security research [7], [8], [9]. However, the effectiveness of ML models is highly dependent on the quality of the datasets used for training. Many existing datasets suffer from limitations such as outdated data, inadequate representation, and a limited number of samples and features, making them unsuitable for modern malware detection [10], [2], [11], [12]. These issues raise concerns about the reliability of reported performance metrics and can potentially lead to misleading conclusions [2]. A growing body of research in Android malware detection strongly supports the notion that increasing the number of discriminative features can significantly improve classification performance [13], [14], [15]. We present in Table I an overview of widely used Android malware datasets from recent years.
arXiv.org Artificial Intelligence
Nov-4-2025
- Country:
- North America > United States (0.04)
- South America > Brazil
- Rio Grande do Sul > Porto Alegre (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: