Avast-CTU Public CAPE Dataset
Bosansky, Branislav, Kouba, Dominik, Manhal, Ondrej, Sick, Thorsten, Lisy, Viliam, Kroustek, Jakub, Somol, Petr
–arXiv.org Artificial Intelligence
There is a plethora of methods for detecting malicious samples (e.g., see surveys [19, 13]). Broadly speaking, we can distinguish two main categories: (1) detecting the samples based on their static features and (2) detecting the samples based on a behavioral analysis. The static features typically consist of considering the whole sample (e.g., as an image [9]) and/or properties of its most important parts (e.g., by examining in details header of a Windows portable executable (PE) file) [18]. The behavioral analysis consists of executing (or simulating the execution) of the sample and logging performed actions in order to determine whether these actions have characteristics of malicious behavior [13]. The main advantage of the first approach is the computational efficiency since extracting static features from the file itself can be much faster compared to the (simulated) execution. On the other hand, the main disadvantage of the static approach is the inability to reliably distinguish malicious samples from benign samples in case the sample is encrypted and/or the clean file is altered in a minor way to exhibit malicious behavior. The methods relying on behavioral analysis can discover malicious behavior even in encrypted samples, however, they require significantly more resources to run or simulate the instructions of the analyzed sample. In either case, the growing number of new, previously unseen samples makes the usage of automated decision/classification methods of artificial intelligence (AI) and machine learning (ML) inevitable in the malware-detection domain. Framing the problem of malware detection as an AI/ML problem reveals interesting and unique properties of the domain that are less prevalent in other domains.
arXiv.org Artificial Intelligence
Sep-6-2022