Gaussier, Eric
On the Fly Detection of Root Causes from Observed Data with Application to IT Systems
Zan, Lei, Assaad, Charles K., Devijver, Emilie, Gaussier, Eric
This paper introduces a new structural causal model tailored for representing threshold-based IT systems and presents a new algorithm designed to rapidly detect root causes of anomalies in such systems. When root causes are not causally related, the method is proven to be correct; while an extension is proposed based on the intervention of an agent to relax this assumption. Our algorithm and its agent-based extension leverage causal discovery from offline data and engage in subgraph traversal when encountering new anomalies in online data. Our extensive experiments demonstrate the superior performance of our methods, even when applied to data generated from alternative structural causal models or real IT monitoring data.
Identifiability of total effects from abstractions of time series causal graphs
Assaad, Charles K., Devijver, Emilie, Gaussier, Eric, Gössler, Gregor, Meynaoui, Anouar
We study the problem of identifiability of the total effect of an intervention from observational time series only given an abstraction of the causal graph of the system. Specifically, we consider two types of abstractions: the extended summary causal graph which conflates all lagged causal relations but distinguishes between lagged and instantaneous relations; and the summary causal graph which does not give any indication about the lag between causal relations. We show that the total effect is always identifiable in extended summary causal graphs and we provide necessary and sufficient graphical conditions for identifiability in summary causal graphs. Furthermore, we provide adjustment sets allowing to estimate the total effect whenever it is identifiable.
Case Studies of Causal Discovery from IT Monitoring Time Series
Aït-Bachir, Ali, Assaad, Charles K., de Bignicourt, Christophe, Devijver, Emilie, Ferreira, Simon, Gaussier, Eric, Mohanna, Hosein, Zan, Lei
Information technology (IT) systems are vital for modern businesses, handling data storage, communication, and process automation. Monitoring these systems is crucial for their proper functioning and efficiency, as it allows collecting extensive observational time series data for analysis. The interest in causal discovery is growing in IT monitoring systems as knowing causal relations between different components of the IT system helps in reducing downtime, enhancing system performance and identifying root causes of anomalies and incidents. It also allows proactive prediction of future issues through historical data analysis. Despite its potential benefits, applying causal discovery algorithms on IT monitoring data poses challenges, due to the complexity of the data. For instance, IT monitoring data often contains misaligned time series, sleeping time series, timestamp errors and missing values. This paper presents case studies on applying causal discovery algorithms to different IT monitoring datasets, highlighting benefits and ongoing challenges.
Hybrids of Constraint-based and Noise-based Algorithms for Causal Discovery from Time Series
Assaad, Charles K., Bystrova, Daria, Arbel, Julyan, Devijver, Emilie, Gaussier, Eric, Thuiller, Wilfried
Constraint-based and noise-based methods have been proposed to discover summary causal graphs from observational time series under strong assumptions which can be violated or impossible to verify in real applications. Recently, a hybrid method (Assaad et al, 2021) that combines these two approaches, proved to be robust to assumption violation. However, this method assumes that the summary causal graph is acyclic, but cycles are common in many applications. For example, in ecological communities, there may be cyclic relationships between predator and prey populations, creating feedback loops. Therefore, this paper presents two new frameworks for hybrids of constraint-based and noise-based methods that can discover summary causal graphs that may or may not contain cycles. For each framework, we provide two hybrid algorithms which are experimentally tested on simulated data, realistic ecological data, and real data from various applications. Experiments show that our hybrid approaches are robust and yield good results over most datasets.
Entropy-based Discovery of Summary Causal Graphs in Time Series
Assaad, Karim, Devijver, Emilie, Gaussier, Eric, Ait-Bachir, Ali
We address in this study the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new temporal mutual information measure defined on a window-based representation of time series. We then show how this measure relates to an entropy reduction principle that can be seen as a special case of the Probabilistic Raising Principle. We finally combine these two ingredients in a PC-like algorithm to construct the summary causal graph. This algorithm is evaluated on several datasets that shows both its efficacy and efficiency.
Terminology-based Text Embedding for Computing Document Similarities on Technical Content
Mirisaee, Hamid, Gaussier, Eric, Lagnier, Cedric, Guerraz, Agnes
We propose in this paper a new, hybrid document embedding approach in order to address the problem of document similarities with respect to the technical content. To do so, we employ a state-of-the-art graph techniques to first extract the keyphrases (composite keywords) of documents and, then, use them to score the sentences. Using the ranked sentences, we propose two approaches to embed documents and show their performances with respect to two baselines. With domain expert annotations, we illustrate that the proposed methods can find more relevant documents and outperform the baselines up to 27% in terms of NDCG.
Implicit Discourse Relation Classification with Syntax-Aware Contextualized Word Representations
Popa, Diana Nicoleta (Université Grenoble Alpes) | Perez, Julien (Naver Labs Europe) | Henderson, James (Idiap Research Institute) | Gaussier, Eric (Université Grenoble Alpes)
Automatically identifying implicit discourse relations requires an in-depth semantic understanding of the text fragments involved in such relations. While early work investigated the usefulness of different classes of input features, current state-of-the-art models mostly rely on standard pre-trained word embeddings to model the arguments of a discourse relation. In this paper, we introduce a method to compute contextualized representations of words, leveraging information from the sentence dependency parse, to improve argument representation. The resulting token embeddings encode the structure of the sentence from a dependency point of view in their representations. Experimental results show that the proposed representations achieve state-of-the-art results when input to standard neural network architectures, surpassing complex models that use additional data and consider the interaction between arguments.
On Inductive Abilities of Latent Factor Models for Relational Learning
Trouillon, Théo, Gaussier, Eric, Dance, Christopher R., Bouchard, Guillaume
Latent factor models are increasingly popular for modeling multi-relational knowledge graphs. By their vectorial nature, it is not only hard to interpret why this class of models works so well, but also to understand where they fail and how they might be improved. We conduct an experimental survey of state-of-the-art models, not towards a purely comparative end, but as a means to get insight about their inductive abilities. To assess the strengths and weaknesses of each model, we create simple tasks that exhibit first, atomic properties of binary relations, and then, common inter-relational inference through synthetic genealogies. Based on these experimental results, we propose new research directions to improve on existing models.
Dealing with Uncertain Inputs in Regression Trees
Tami, Myriam, Clausel, Marianne, Devijver, Emilie, Dulac, Adrien, Gaussier, Eric, Janaqi, Stefan, Chebre, Meriam
Tree-based ensemble methods, as Random Forests and Gradient Boosted Trees, have been successfully used for regression in many applications and research studies. Furthermore, these methods have been extended in order to deal with uncertainty in the output variable, using for example a quantile loss in Random Forests (Meinshausen, 2006). To the best of our knowledge, no extension has been provided yet for dealing with uncertainties in the input variables, even though such uncertainties are common in practical situations. We propose here such an extension by showing how standard regression trees optimizing a quadratic loss can be adapted and learned while taking into account the uncertainties in the input. By doing so, one no longer assumes that an observation lies into a single region of the regression tree, but rather that it belongs to each region with a certain probability. Experiments conducted on several data sets illustrate the good behavior of the proposed extension.
Deep $k$-Means: Jointly Clustering with $k$-Means and Learning Representations
Fard, Maziar Moradi, Thonet, Thibaut, Gaussier, Eric
We study in this paper the problem of jointly clustering and learning representations. As several previous studies have shown, learning representations that are both faithful to the data to be clustered and adapted to the clustering algorithm can lead to better clustering performance, all the more so that the two tasks are performed jointly. We propose here such an approach for $k$-Means clustering based on a continuous reparametrization of the objective function that leads to a truly joint solution. The behavior of our approach is illustrated on various datasets showing its efficacy in learning representations for objects while clustering them.