The identification of anomalies in temporal data is a core component of numerous research areas such as intrusion detection, fault prevention, genomics and fraud detection. This article provides an experimental comparison of the novelty detection problem applied to discrete sequences. The objective of this study is to identify which state-of-the-art methods are efficient and appropriate candidates for a given use case. These recommendations rely on extensive novelty detection experiments based on a variety of public datasets in addition to novel industrial datasets. We also perform thorough scalability and memory usage tests resulting in new supplementary insights of the methods' performance, key selection criterion to solve problems relying on large volumes of data and to meet the expectations of applications subject to strict response time constraints.
The analysis of the behaviour of individuals and entities (UEBA) is an area of artificial intelligence that detects hostile actions (e.g. attacks, fraud, influence, poisoning) due to the unusual nature of observed events, by affixing to a signature-based operation. A UEBA process usually involves two phases, learning and inference. Intrusion detection systems (IDS) available still suffer from bias, including over-simplification of problems, underexploitation of the AI potential, insufficient consideration of the temporality of events, and perfectible management of the memory cycle of behaviours. In addition, while an alert generated by a signature-based IDS can refer to the signature on which the detection is based, the IDS in the UEBA domain produce results, often associated with a score, whose explainable character is less obvious. Our unsupervised approach is to enrich this process by adding a third phase to correlate events (incongruities, weak signals) that are presumed to be linked together, with the benefit of a reduction of false positives and negatives. We also seek to avoid a so-called "boiled frog" bias inherent in continuous learning. Our first results are interesting and have an explainable character, both on synthetic and real data.
Recent advancements in Artificial Intelligence (AI) have brought new capabilities to behavioural analysis (UEBA) for cyber-security consisting in the detection of hostile action based on the unusual nature of events observed on the Information System.In our previous work (presented at C\&ESAR 2018 and FIC 2019), we have associated deep neural networks auto-encoders for anomaly detection and graph-based events correlation to address major limitations in UEBA systems. This resulted in reduced false positive and false negative rates, improved alert explainability, while maintaining real-time performances and scalability. However, we did not address the natural evolution of behaviours through time, also known as concept drift. To maintain effective detection capabilities, an anomaly-based detection system must be continually trained, which opens a door to an adversary that can conduct the so-called "frog-boiling" attack by progressively distilling unnoticed attack traces inside the behavioural models until the complete attack is considered normal. In this paper, we present a solution to effectively mitigate this attack by improving the detection process and efficiently leveraging human expertise. We also present preliminary work on adversarial AI conducting deception attack, which, in term, will be used to help assess and improve the defense system. These defensive and offensive AI implement joint, continual and active learning, in a step that is necessary in assessing, validating and certifying AI-based defensive solutions.
This article describes different models based on Bayesian networks RB modeling expertise in the diagnosis of brain tumors. Indeed, they are well adapted to the representation of the uncertainty in the process of diagnosis of these tumors. In our work, we first tested several structures derived from the Bayesian network reasoning performed by doctors on the one hand and structures generated automatically on the other. This step aims to find the best structure that increases diagnostic accuracy. The machine learning algorithms relate MWST-EM algorithms, SEM and SEM + T. To estimate the parameters of the Bayesian network from a database incomplete, we have proposed an extension of the EM algorithm by adding a priori knowledge in the form of the thresholds calculated by the first phase of the algorithm RBE . The very encouraging results obtained are discussed at the end of the paper
Data-stream clustering is an ever-expanding subdomain of knowledge extraction. Most of the past and present research effort aims at efficient scaling up for the huge data repositories. Our approach focuses on qualitative improvement, mainly for "weak signals" detection and precise tracking of topical evolutions in the framework of information watch - though scalability is intrinsically guaranteed in a possibly distributed implementation. Our GERMEN algorithm exhaustively picks up the whole set of density peaks of the data at time t, by identifying the local perturbations induced by the current document vector, such as changing cluster borders, or new/vanishing clusters. Optimality yields from the uniqueness 1) of the density landscape for any value of our zoom parameter, 2) of the cluster allocation operated by our border propagation rule. This results in a rigorous independence from the data presentation ranking or any initialization parameter. We present here as a first step the only assessment of a static view resulting from one year of the CNRS/INIST Pascal database in the field of geotechnics.