Cheney, James
Hack Me If You Can: Aggregating AutoEncoders for Countering Persistent Access Threats Within Highly Imbalanced Data
Benabderrahmane, Sidahmed, Hoang, Ngoc, Valtchev, Petko, Cheney, James, Rahwan, Talal
Advanced Persistent Threats (APTs) are sophisticated, targeted cyberattacks designed to gain unauthorized access to systems and remain undetected for extended periods. To evade detection, APT cyberattacks deceive defense layers with breaches and exploits, thereby complicating exposure by traditional anomaly detection-based security methods. The challenge of detecting APTs with machine learning is compounded by the rarity of relevant datasets and the significant imbalance in the data, which makes the detection process highly burdensome. We present AE-APT, a deep learning-based tool for APT detection that features a family of AutoEncoder methods ranging from a basic one to a Transformer-based one. We evaluated our tool on a suite of provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The outcomes showed that AE-APT has significantly higher detection rates compared to its competitors, indicating superior performance in detecting and ranking anomalies.
A Rule Mining-Based Advanced Persistent Threats Detection System
Benabderrahmane, Sidahmed, Berrada, Ghita, Cheney, James, Valtchev, Petko
Advanced persistent threats (APT) are stealthy cyber-attacks that are aimed at stealing valuable information from target organizations and tend to extend in time. Blocking all APTs is impossible, security experts caution, hence the importance of research on early detection and damage limitation. Whole-system provenance-tracking and provenance trace mining are considered promising as they can help find causal relationships between activities and flag suspicious event sequences as they occur. We introduce an unsupervised method that exploits OS-independent features reflecting process activity to detect realistic APT-like attacks from provenance traces. Anomalous processes are ranked using both frequent and rare event associations learned from traces. Results are then presented as implications which, since interpretable, help leverage causality in explaining the detected anomalies. When evaluated on Transparent Computing program datasets (DARPA), our method outperformed competing approaches.
Categorical anomaly detection in heterogeneous data using minimum description length clustering
Cheney, James, Gombau, Xavier, Berrada, Ghita, Benabderrahmane, Sidahmed
Two examples of anomaly detection based on MDL have been been proposed for categorical data based on the minimum description studied and shown to perform well: the OC3 algorithm [21] based length (MDL) principle. However, they can be ineffective when on an itemset mining technique called Krimp [26], and the CompreX detecting anomalies in heterogeneous datasets representing a mixture algorithm [2]. Broadly speaking, both take a similar approach: of different sources, such as security scenarios in which system first, a model H of the data that compresses it well is found using a and user processes have distinct behavior patterns. We propose a heuristic search, balancing the model complexity L(H) (number of meta-algorithm for enhancing any MDL-based anomaly detection bits required to compress the model structure/parameters) against model to deal with heterogeneous data by fitting a mixture model the data complexity L(X H) (number of bits required to compress to the data, via a variant of k-means clustering. Our experimental the data given the model). Once such a model H is found, we assign results show that using a discrete mixture model provides competitive to each object x X a score corresponding to the object's performance relative to two previous anomaly detection compressed size L(x H) given the selected model. Intuitively, if the algorithms, while mixtures of more sophisticated models yield further model accurately characterizes the data as a whole, records that are gains, on both synthetic datasets and realistic datasets from a representative will compress well, yielding a low anomaly score, security scenario.