malware detector
Certifiably robust malware detectors by design
Gimenez, Pierre-Francois, Sivaprasad, Sarath, Fritz, Mario
Malware analysis involves analyzing suspicious software to detect malicious payloads. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example of malware without altering its functionality requires specific transformations. We propose a new model architecture for certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. We compare and validate these approaches with machine-learning-based malware detection methods, allowing for robust detection with limited reduction of detection performance.
- Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Germany (0.04)
- Europe > France > Brittany > Ille-et-Vilaine > Rennes (0.04)
Empirical Quantification of Spurious Correlations in Malware Detection
Perasso, Bianca, Lozza, Ludovico, Ponte, Andrea, Demetrio, Luca, Oneto, Luca, Roli, Fabio
End-to-end deep learning exhibits unmatched performance for detecting malware, but such an achievement is reached by exploiting spurious correlations -- features with high relevance at inference time, but known to be useless through domain knowledge. While previous work highlighted that deep networks mainly focus on metadata, none investigated the phenomenon further, without quantifying their impact on the decision. In this work, we deepen our understanding of how spurious correlation affects deep learning for malware detection by highlighting how much models rely on empty spaces left by the compiler, which diminishes the relevance of the compiled code. Through our seminal analysis on a small-scale balanced dataset, we introduce a ranking of two end-to-end models to better understand which is more suitable to be put in production.
- Europe > Italy > Sardinia > Cagliari (0.04)
- Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Learning Temporal Invariance in Android Malware Detectors
Zheng, Xinran, Yang, Shuo, Ngai, Edith C. H., Jana, Suman, Cavallaro, Lorenzo
Learning-based Android malware detectors degrade over time due to natural distribution drift caused by malware variants and new families. This paper systematically investigates the challenges classifiers trained with empirical risk minimization (ERM) face against such distribution shifts and attributes their shortcomings to their inability to learn stable discriminative features. Invariant learning theory offers a promising solution by encouraging models to generate stable representations crossing environments that expose the instability of the training set. However, the lack of prior environment labels, the diversity of drift factors, and low-quality representations caused by diverse families make this task challenging. To address these issues, we propose TIF, the first temporal invariant training framework for malware detection, which aims to enhance the ability of detectors to learn stable representations across time. TIF organizes environments based on application observation dates to reveal temporal drift, integrating specialized multi-proxy contrastive learning and invariant gradient alignment to generate and align environments with high-quality, stable representations. TIF can be seamlessly integrated into any learning-based detector. Experiments on a decade-long dataset show that TIF excels, particularly in early deployment stages, addressing real-world needs and outperforming state-of-the-art methods.
- Research Report > New Finding (0.92)
- Research Report > Promising Solution (0.68)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
DexRay: A Simple, yet Effective Deep Learning Approach to Android Malware Detection based on Image Representation of Bytecode
Daoudi, Nadia, Samhi, Jordan, Kabore, Abdoul Kader, Allix, Kevin, Bissyandé, Tegawendé F., Klein, Jacques
Computer vision has witnessed several advances in recent years, with unprecedented performance provided by deep representation learning research. Image formats thus appear attractive to other fields such as malware detection, where deep learning on images alleviates the need for comprehensively hand-crafted features generalising to different malware variants. We postulate that this research direction could become the next frontier in Android malware detection, and therefore requires a clear roadmap to ensure that new approaches indeed bring novel contributions. We contribute with a first building block by developing and assessing a baseline pipeline for image-based malware detection with straightforward steps. We propose DexRay, which converts the bytecode of the app DEX files into grey-scale "vector" images and feeds them to a 1-dimensional Convolutional Neural Network model. We view DexRay as foundational due to the exceedingly basic nature of the design choices, allowing to infer what could be a minimal performance that can be obtained with image-based learning in malware detection. The performance of DexRay evaluated on over 158k apps demonstrates that, while simple, our approach is effective with a high detection rate (F1-score= 0.96). Finally, we investigate the impact of time decay and image-resizing on the performance of DexRay and assess its resilience to obfuscation. This work-in-progress paper contributes to the domain of Deep Learning based Malware detection by providing a sound, simple, yet effective approach (with available artefacts) that can be the basis to scope the many profound questions that will need to be investigated to fully develop this domain.
- North America > United States > New York > New York County > New York City (0.05)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (3 more...)
SCORE: Syntactic Code Representations for Static Script Malware Detection
Erdemir, Ecenaz, Park, Kyuhong, Morais, Michael J., Gao, Vianne R., Marschalek, Marion, Fan, Yi
As businesses increasingly adopt cloud technologies, they also need to be aware of new security challenges, such as server-side script attacks, to ensure the integrity of their systems and data. These scripts can steal data, compromise credentials, and disrupt operations. Unlike executables with standardized formats (e.g., ELF, PE), scripts are plaintext files with diverse syntax, making them harder to detect using traditional methods. As a result, more sophisticated approaches are needed to protect cloud infrastructures from these evolving threats. In this paper, we propose novel feature extraction and deep learning (DL)-based approaches for static script malware detection, targeting server-side threats. We extract features from plain-text code using two techniques: syntactic code highlighting (SCH) and abstract syntax tree (AST) construction. SCH leverages complex regexes to parse syntactic elements of code, such as keywords, variable names, etc. ASTs generate a hierarchical representation of a program's syntactic structure. We then propose a sequential and a graph-based model that exploits these feature representations to detect script malware. We evaluate our approach on more than 400K server-side scripts in Bash, Python and Perl. We use a balanced dataset of 90K scripts for training, validation, and testing, with the remaining from 400K reserved for further analysis. Experiments show that our method achieves a true positive rate (TPR) up to 81% higher than leading signature-based antivirus solutions, while maintaining a low false positive rate (FPR) of 0.17%. Moreover, our approach outperforms various neural network-based detectors, demonstrating its effectiveness in learning code maliciousness for accurate detection of script malware.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Assessing the Impact of Packing on Machine Learning-Based Malware Detection and Classification Systems
Gibert, Daniel, Totosis, Nikolaos, Patsakis, Constantinos, Zizzo, Giulio, Le, Quan
The proliferation of malware, particularly through the use of packing, presents a significant challenge to static analysis and signature-based malware detection techniques. The application of packing to the original executable code renders extracting meaningful features and signatures challenging. To deal with the increasing amount of malware in the wild, researchers and anti-malware companies started harnessing machine learning capabilities with very promising results. However, little is known about the effects of packing on static machine learning-based malware detection and classification systems. This work addresses this gap by investigating the impact of packing on the performance of static machine learning-based models used for malware detection and classification, with a particular focus on those using visualisation techniques. To this end, we present a comprehensive analysis of various packing techniques and their effects on the performance of machine learning-based detectors and classifiers. Our findings highlight the limitations of current static detection and classification systems and underscore the need to be proactive to effectively counteract the evolving tactics of malware authors.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Europe > Greece > Attica > Athens (0.04)
- (5 more...)
Improving Adversarial Robustness in Android Malware Detection by Reducing the Impact of Spurious Correlations
Bostani, Hamid, Zhao, Zhengyu, Moonsamy, Veelasha
Machine learning (ML) has demonstrated significant advancements in Android malware detection (AMD); however, the resilience of ML against realistic evasion attacks remains a major obstacle for AMD. One of the primary factors contributing to this challenge is the scarcity of reliable generalizations. Malware classifiers with limited generalizability tend to overfit spurious correlations derived from biased features. Consequently, adversarial examples (AEs), generated by evasion attacks, can modify these features to evade detection. In this study, we propose a domain adaptation technique to improve the generalizability of AMD by aligning the distribution of malware samples and AEs. Specifically, we utilize meaningful feature dependencies, reflecting domain constraints in the feature space, to establish a robust feature space. Training on the proposed robust feature space enables malware classifiers to learn from predefined patterns associated with app functionality rather than from individual features. This approach helps mitigate spurious correlations inherent in the initial feature space. Our experiments conducted on DREBIN, a renowned Android malware detector, demonstrate that our approach surpasses the state-of-the-art defense, Sec-SVM, when facing realistic evasion attacks. In particular, our defense can improve adversarial robustness by up to 55% against realistic evasion attacks compared to Sec-SVM.
- Europe > Germany (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
On the Robustness of Malware Detectors to Adversarial Samples
Salman, Muhammad, Zhao, Benjamin Zi Hao, Asghar, Hassan Jameel, Ikram, Muhammad, Kaushik, Sidharth, Kaafar, Mohamed Ali
Adversarial examples add imperceptible alterations to inputs with the objective to induce misclassification in machine learning models. They have been demonstrated to pose significant challenges in domains like image classification, with results showing that an adversarially perturbed image to evade detection against one classifier is most likely transferable to other classifiers. Adversarial examples have also been studied in malware analysis. Unlike images, program binaries cannot be arbitrarily perturbed without rendering them non-functional. Due to the difficulty of crafting adversarial program binaries, there is no consensus on the transferability of adversarially perturbed programs to different detectors. In this work, we explore the robustness of malware detectors against adversarially perturbed malware. We investigate the transferability of adversarial attacks developed against one detector, against other machine learning-based malware detectors, and code similarity techniques, specifically, locality sensitive hashing-based detectors. Our analysis reveals that adversarial program binaries crafted for one detector are generally less effective against others. We also evaluate an ensemble of detectors and show that they can potentially mitigate the impact of adversarial program binaries. Finally, we demonstrate that substantial program changes made to evade detection may result in the transformation technique being identified, implying that the adversary must make minimal changes to the program binary.
SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines
Ponte, Andrea, Trizna, Dmitrijs, Demetrio, Luca, Biggio, Battista, Ogbu, Ivan Tesfai, Roli, Fabio
As a result of decades of research, Windows malware detection is approached through a plethora of techniques. However, there is an ongoing mismatch between academia -- which pursues an optimal performances in terms of detection rate and low false alarms -- and the requirements of real-world scenarios. In particular, academia focuses on combining static and dynamic analysis within a single or ensemble of models, falling into several pitfalls like (i) firing dynamic analysis without considering the computational burden it requires; (ii) discarding impossible-to-analyse samples; and (iii) analysing robustness against adversarial attacks without considering that malware detectors are complemented with more non-machine-learning components. Thus, in this paper we propose SLIFER, a novel Windows malware detection pipeline sequentially leveraging both static and dynamic analysis, interrupting computations as soon as one module triggers an alarm, requiring dynamic analysis only when needed. Contrary to the state of the art, we investigate how to deal with samples resistance to analysis, showing how much they impact performances, concluding that it is better to flag them as legitimate to not drastically increase false alarms. Lastly, we perform a robustness evaluation of SLIFER leveraging content-injections attacks, and we show that, counter-intuitively, attacks are blocked more by YARA rules than dynamic analysis due to byte artifacts created while optimizing the adversarial strategy.
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.05)
- Europe > Italy > Liguria > Genoa (0.04)
- Europe > Czechia > Prague (0.04)
- (4 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
A New Formulation for Zeroth-Order Optimization of Adversarial EXEmples in Malware Detection
Rando, Marco, Demetrio, Luca, Rosasco, Lorenzo, Roli, Fabio
Machine learning malware detectors are vulnerable to adversarial EXEmples, i.e. carefully-crafted Windows programs tailored to evade detection. Unlike other adversarial problems, attacks in this context must be functionality-preserving, a constraint which is challenging to address. As a consequence heuristic algorithms are typically used, that inject new content, either randomly-picked or harvested from legitimate programs. In this paper, we show how learning malware detectors can be cast within a zeroth-order optimization framework which allows to incorporate functionality-preserving manipulations. This permits the deployment of sound and efficient gradient-free optimization algorithms, which come with theoretical guarantees and allow for minimal hyper-parameters tuning. As a by-product, we propose and study ZEXE, a novel zero-order attack against Windows malware detection. Compared to state-of-the-art techniques, ZEXE provides drastic improvement in the evasion rate, while reducing to less than one third the size of the injected content.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)