Clustering
Clustering-based Identification of Precursors of Extreme Events in Chaotic Systems
Golyska, Urszula, Doan, Nguyen Anh Khoa
Abrupt and rapid high-amplitude changes in a dynamical system's states known as extreme event appear in many processes occurring in nature, such as drastic climate patterns, rogue waves, or avalanches. These events often entail catastrophic effects, therefore their description and prediction is of great importance. However, because of their chaotic nature, their modelling represents a great challenge up to this day. The applicability of a data-driven modularity-based clustering technique to identify precursors of rare and extreme events in chaotic systems is here explored. The proposed identification framework based on clustering of system states, probability transition matrices and state space tessellation was developed and tested on two different chaotic systems that exhibit extreme events: the Moehliss-Faisst-Eckhardt model of self-sustained turbulence and the 2D Kolmogorov flow. Both exhibit extreme events in the form of bursts in kinetic energy and dissipation. It is shown that the proposed framework provides a way to identify pathways towards extreme events and predict their occurrence from a probabilistic standpoint. The clustering algorithm correctly identifies the precursor states leading to extreme events and allows for a statistical description of the system's states and its precursors to extreme events.
A Survey on Safety-Critical Driving Scenario Generation -- A Methodological Perspective
Ding, Wenhao, Xu, Chejian, Arief, Mansur, Lin, Haohong, Li, Bo, Zhao, Ding
Autonomous driving systems have witnessed a significant development during the past years thanks to the advance in machine learning-enabled sensing and decision-making algorithms. One critical challenge for their massive deployment in the real world is their safety evaluation. Most existing driving systems are still trained and evaluated on naturalistic scenarios collected from daily life or heuristically-generated adversarial ones. However, the large population of cars, in general, leads to an extremely low collision rate, indicating that the safety-critical scenarios are rare in the collected real-world data. Thus, methods to artificially generate scenarios become crucial to measure the risk and reduce the cost. In this survey, we focus on the algorithms of safety-critical scenario generation in autonomous driving. We first provide a comprehensive taxonomy of existing algorithms by dividing them into three categories: data-driven generation, adversarial generation, and knowledge-based generation. Then, we discuss useful tools for scenario generation, including simulation platforms and packages. Finally, we extend our discussion to five main challenges of current works -- fidelity, efficiency, diversity, transferability, controllability -- and research opportunities lighted up by these challenges.
Unsupervised Framework for Evaluating and Explaining Structural Node Embeddings of Graphs
Dehghan, Ashkan, Siuta, Kinga, Skorupka, Agata, Betlen, Andrei, Miller, David, Kaminski, Bogumil, Pralat, Pawel
An embedding is a mapping from a set of nodes of a network into a real vector space. Embeddings can have various aims like capturing the underlying graph topology and structure, node-to-node relationship, or other relevant information about the graph, its subgraphs or nodes themselves. A practical challenge with using embeddings is that there are many available variants to choose from. Selecting a small set of most promising embeddings from the long list of possible options for a given task is challenging and often requires domain expertise. Embeddings can be categorized into two main types: classical embeddings and structural embeddings. Classical embeddings focus on learning both local and global proximity of nodes, while structural embeddings learn information specifically about the local structure of nodes' neighbourhood. For classical node embeddings there exists a framework which helps data scientists to identify (in an unsupervised way) a few embeddings that are worth further investigation. Unfortunately, no such framework exists for structural embeddings. In this paper we propose a framework for unsupervised ranking of structural graph embeddings. The proposed framework, apart from assigning an aggregate quality score for a structural embedding, additionally gives a data scientist insights into properties of this embedding. It produces information which predefined node features the embedding learns, how well it learns them, and which dimensions in the embedded space represent the predefined node features. Using this information the user gets a level of explainability to an otherwise complex black-box embedding algorithm.
Voter Coalitions and democracy in Decentralized Finance: Evidence from MakerDAO
Sun, Xiaotong, Chen, Xi, Stasinakis, Charalampos, Sermpinis, Georgios
Decentralized Autonomous Organization (DAO) provides a decentralized governance solution through blockchain, where decision-making process relies on on-chain voting and follows majority rule. This paper focuses on MakerDAO, and we find three voter coalitions after applying clustering algorithm to voting history. The emergence of a dominant voter coalition is a signal of governance centralization in DAO, and voter coalitions have complicated influence on Maker protocol, which is governed by MakerDAO. This paper presents empirical evidence of multicoalition democracy in DAO and further contributes to the contemporary debate on whether decentralized governance is possible.
Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model
Ariu, Kaito, Proutiere, Alexandre, Yun, Se-Young
Community detection or clustering refers to the task of gathering similar items into a few groups from the data that, most often, correspond to observations of pair-wise interactions between items Newman and Girvan [2004]. A benchmark commonly used to assess the performance of clustering algorithms is the celebrated Stochastic Block Model (SBM) Holland et al. [1983], where pair-wise interactions are represented by a random graph. In this graph, the vertices correspond to items, and the presence of an edge between two items indicates their interaction. The SBM has been extensively studied over the last two decades; for a recent survey, see Abbe [2018]. However, it provides a relatively simplistic view of how items may interact. In real applications, interactions can be of different types (e.g., represented by ratings in recommender systems or a level of proximity between users in a social network). To capture this richer information about item interactions, the Labeled Stochastic Block Model (LSBM), proposed and analyzed in Heimlicher et al. [2012], Lelarge et al. [2013], Yun and Proutiere [2016], describes interactions by labels drawn from an arbitrary collection. The objective of this paper is to devise a clustering algorithm that, based on the observation of these labels, reconstructs the clusters of items while minimizing the expected number of misclassified items. In the following, we formally introduce LSBMs and outline our results.
A Cloud-based Machine Learning Pipeline for the Efficient Extraction of Insights from Customer Reviews
Lakatos, Robert, Bogacsovics, Gergo, Harangi, Balazs, Lakatos, Istvan, Tiba, Attila, Toth, Janos, Szabo, Marianna, Hajdu, Andras
Abstract-- The efficiency of natural language processing the text may be considered noise, depending on how we view has improved dramatically with the advent of the data and what we think is relevant. This in turn makes machine learning models, particularly neural networkbased it difficult to solve this task: it is not enough to find some solutions. However, some tasks are still challenging, specific information in texts, but we also have to decide what especially when considering specific domains. In this information we need based on the texts. Our application is an end-to-end methods integrated into a pipeline. For topic modeling, system that runs on a cloud-based infrastructure and can be our composite model uses transformer-based neural networks used as a service by small and medium-sized businesses. Furthermore, our some aspects, outperforms currently available approaches. Our environment, we needed a model that could identify topics approach is validated and compared with other stateof-the-art in texts and provide a way to determine which topics are relevant, methods using publicly available datasets for given our analysis goals. We focused on these tools Users of social platforms, forums, and online stores generate because we found them to be the most appropriate for our a significant amount of textual data. One of the most goals based on the available literature. It was also important useful applications of machine learning-based text processing to us that these methods and complex models have a stable is to find words and phrases that describe the content of implementation, are verifiably executable in our cloud-based these texts.
Principled and Efficient Motif Finding for Structure Learning of Lifted Graphical Models
Feldstein, Jonathan, Phillips, Dominic, Tsamoura, Efthymia
Structure learning is a core problem in AI central to the fields of neuro-symbolic AI and statistical relational learning. It consists in automatically learning a logical theory from data. The basis for structure learning is mining repeating patterns in the data, known as structural motifs. Finding these patterns reduces the exponential search space and therefore guides the learning of formulas. Despite the importance of motif learning, it is still not well understood. We present the first principled approach for mining structural motifs in lifted graphical models, languages that blend first-order logic with probabilistic models, which uses a stochastic process to measure the similarity of entities in the data. Our first contribution is an algorithm, which depends on two intuitive hyperparameters: one controlling the uncertainty in the entity similarity measure, and one controlling the softness of the resulting rules. Our second contribution is a preprocessing step where we perform hierarchical clustering on the data to reduce the search space to the most relevant data. Our third contribution is to introduce an O(n ln n) (in the size of the entities in the data) algorithm for clustering structurally-related data. We evaluate our approach using standard benchmarks and show that we outperform state-of-the-art structure learning approaches by up to 6% in terms of accuracy and up to 80% in terms of runtime.
Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees
Vielhaben, Johanna, Blücher, Stefan, Strodthoff, Nils
For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. To this end, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed global understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. Thus, MCD paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.
High-dimensional Clustering onto Hamiltonian Cycle
Huang, Tianyi, Cheng, Shenghui, Li, Stan Z., Zhang, Zhengjun
Clustering aims to group unlabelled samples based on their similarities. It has become a significant tool for the analysis of high-dimensional data. However, most of the clustering methods merely generate pseudo labels and thus are unable to simultaneously present the similarities between different clusters and outliers. This paper proposes a new framework called High-dimensional Clustering onto Hamiltonian Cycle (HCHC) to solve the above problems. First, HCHC combines global structure with local structure in one objective function for deep clustering, improving the labels as relative probabilities, to mine the similarities between different clusters while keeping the local structure in each cluster. Then, the anchors of different clusters are sorted on the optimal Hamiltonian cycle generated by the cluster similarities and mapped on the circumference of a circle. Finally, a sample with a higher probability of a cluster will be mapped closer to the corresponding anchor. In this way, our framework allows us to appreciate three aspects visually and simultaneously - clusters (formed by samples with high probabilities), cluster similarities (represented as circular distances), and outliers (recognized as dots far away from all clusters). The experiments illustrate the superiority of HCHC.
Adversarially robust clustering with optimality guarantees
Jana, Soham, Yang, Kun, Kulkarni, Sanjeev
We consider the problem of clustering data points coming from sub-Gaussian mixtures. Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers. In contrast, clustering methods seemingly robust to adversarial perturbations are not known to satisfy the optimal statistical guarantees. We propose a simple algorithm that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present. Our algorithm achieves the optimal error rate in constant iterations when a weak initialization condition is satisfied. In the absence of outliers, in fixed dimensions, our theoretical guarantees are similar to that of the Lloyd algorithm. Extensive experiments on various simulated data sets are conducted to support the theoretical guarantees of our method.