Clustering
Customizable Reference Runtime Monitoring of Neural Networks using Resolution Boxes
Wu, Changshun, Falcone, Yliès, Bensalem, Saddek
We present an approach for the runtime verification of classification systems via data abstraction. Data abstraction relies on the notion of box with a resolution. Boxbased abstraction consists in representing a set of values by its minimal and maximal values in each dimension. We augment boxes with a notion of resolution; this allows to define the notion of clustering coverage, which is intuitively a quantitative metric over boxes that indicates the quality of the abstraction. This allows studying the effect of different clustering parameters on the constructed boxes and estimating an interval of sub-optimal parameters. Moreover, we show how to automatically construct monitors that make use of both the correct and incorrect behaviors of a classification system. This allows checking the size of the monitor abstractions and analysing the separability of the network. Monitors are obtained by combining the sub-monitors of each class of the system placed at some selected layers. Our experiments demonstrate the effectiveness of our clustering coverage estimation and show how to assess the effectiveness and precision of monitors according to the selected clustering parameter and the chosen monitored layers.
Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning
Fahmy, Hazem, Pastore, Fabrizio, Bagherzadeh, Mojtaba, Briand, Lionel
Deep neural networks (DNNs) are increasingly important in safety-critical systems, for example in their perception layer to analyze images. Unfortunately, there is a lack of methods to ensure the functional safety of DNN-based components. We observe three major challenges with existing practices regarding DNNs in safety-critical systems: (1) scenarios that are underrepresented in the test set may lead to serious safety violation risks, but may, however, remain unnoticed; (2) characterizing such high-risk scenarios is critical for safety analysis; (3) retraining DNNs to address these risks is poorly supported when causes of violations are difficult to determine. To address these problems in the context of DNNs analyzing images, we propose HUDD, an approach that automatically supports the identification of root causes for DNN errors. HUDD identifies root causes by applying a clustering algorithm to heatmaps capturing the relevance of every DNN neuron on the DNN outcome. Also, HUDD retrains DNNs with images that are automatically selected based on their relatedness to the identified image clusters. We evaluated HUDD with DNNs from the automotive domain. HUDD was able to identify all the distinct root causes of DNN errors, thus supporting safety analysis. Also, our retraining approach has shown to be more effective at improving DNN accuracy than existing approaches.
Knowledge Triggering, Extraction and Storage via Human-Robot Verbal Interaction
Grassi, Lucrezia, Recchiuto, Carmine Tommaso, Sgorbissa, Antonio
This article describes a novel approach to expand in run-time the knowledge base of an Artificial Conversational Agent. A technique for automatic knowledge extraction from the user's sentence and four methods to insert the new acquired concepts in the knowledge base have been developed and integrated into a system that has already been tested for knowledge-based conversation between a social humanoid robot and residents of care homes. The run-time addition of new knowledge allows overcoming some limitations that affect most robots and chatbots: the incapability of engaging the user for a long time due to the restricted number of conversation topics. The insertion in the knowledge base of new concepts recognized in the user's sentence is expected to result in a wider range of topics that can be covered during an interaction, making the conversation less repetitive. Two experiments are presented to assess the performance of the knowledge extraction technique, and the efficiency of the developed insertion methods when adding several concepts in the Ontology.
Understanding and Accelerating EM Algorithm's Convergence by Fair Competition Principle and Rate-Verisimilitude Function
Why can the Expectation-Maximization (EM) algorithm for mixture models converge? Why can different initial parameters cause various convergence difficulties? The Q-L synchronization theory explains that the observed data log-likelihood L and the complete data log-likelihood Q are positively correlated; we can achieve maximum L by maximizing Q. According to this theory, the Deterministic Annealing EM (DAEM) algorithm's authors make great efforts to eliminate locally maximal Q for avoiding L's local convergence. However, this paper proves that in some cases, Q may and should decrease for L to increase; slow or local convergence exists only because of small samples and unfair competition. This paper uses marriage competition to explain different convergence difficulties and proposes the Fair Competition Principle (FCP) with an initialization map for improving initializations. It uses the rate-verisimilitude function, extended from the rate-distortion function, to explain the convergence of the EM and improved EM algorithms. This convergence proof adopts variational and iterative methods that Shannon et al. used for analyzing rate-distortion functions. The initialization map can vastly save both algorithms' running times for binary Gaussian mixtures. The FCP and the initialization map are useful for complicated mixtures but not sufficient; we need further studies for specific methods.
Skeleton Clustering: Dimension-Free Density-based Clustering
We introduce a density-based clustering method called skeleton clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios.
Modeling Classroom Occupancy using Data of WiFi Infrastructure in a University Campus
Mohottige, Iresha Pasquel, Gharakheili, Hassan Habibi, Sivaraman, Vijay, Moors, Tim
Universities worldwide are experiencing a surge in enrollments, therefore campus estate managers are seeking continuous data on attendance patterns to optimize the usage of classroom space. As a result, there is an increasing trend to measure classrooms attendance by employing various sensing technologies, among which pervasive WiFi infrastructure is seen as a low cost method. In a dense campus environment, the number of connected WiFi users does not well estimate room occupancy since connection counts are polluted by adjoining rooms, outdoor walkways, and network load balancing. In this paper, we develop machine learning based models to infer classroom occupancy from WiFi sensing infrastructure. Our contributions are three-fold: (1) We analyze metadata from a dense and dynamic wireless network comprising of thousands of access points (APs) to draw insights into coverage of APs, behavior of WiFi connected users, and challenges of estimating room occupancy; (2) We propose a method to automatically map APs to classrooms using unsupervised clustering algorithms; and (3) We model classroom occupancy using a combination of classification and regression methods of varying algorithms. We achieve 84.6% accuracy in mapping APs to classrooms while the accuracy of our estimation for room occupancy is comparable to beam counter sensors with a symmetric Mean Absolute Percentage Error (sMAPE) of 13.10%.
Machine Learning A-Z : Hands-On Python & R In Data Science
Learn to create Machine Learning Algorithms in Python and R from two Data Science experts. And in this section we're talking about the K means clustering algorithm. And in this tutorial we're going to talk about the intuition behind Kamins. So Kamins is a algorithm that allows you to closter your data and as we will see it's a very convenient tool for discovering categories of groups in your data set that you wouldn't have otherwise thought of yourself. And in this section or in this specific tutorial we'll learn how to understand k means on an intuitive level and we'll see an example of Hardwick's.
Heterogeneous Tensor Mixture Models in High Dimensions
Cai, Biao, Zhang, Jingfei, Sun, Will Wei
We consider the problem of jointly modeling and clustering populations of tensors by introducing a flexible high-dimensional tensor mixture model with heterogeneous covariances. The proposed mixture model exploits the intrinsic structures of tensor data, and is assumed to have means that are low-rank and internally sparse as well as heterogeneous covariances that are separable and conditionally sparse. We develop an efficient high-dimensional expectation-conditional-maximization (HECM) algorithm that breaks the challenging optimization in the M-step into several simpler conditional optimization problems, each of which is convex, admits regularization and has closed-form updating formulas. We show that the proposed HECM algorithm, with an appropriate initialization, converges geometrically to a neighborhood that is within statistical precision of the true parameter. Such a theoretical analysis is highly nontrivial due to the dual non-convexity arising from both the EM-type estimation and the non-convex objective function in the M-step. The efficacy of our proposed method is demonstrated through simulation studies and an application to an autism spectrum disorder study, where our analysis identifies important brain regions for diagnosis.
Variational Co-embedding Learning for Attributed Network Clustering
Yang, Shuiqiao, Verma, Sunny, Cai, Borui, Jiang, Jiaojiao, Yu, Kun, Chen, Fang, Yu, Shui
Abstract--Recent works for attributed network clustering utilize graph convolution to obtain node embeddings and simultaneously perform clustering assignments on the embedding space. It is effective since graph convolution combines the structural and attributive information for node embedding learning. However, a major limitation of such works is that the graph convolution only incorporates the attribute information from the local neighborhood of nodes but fails to exploit the mutual affinities between nodes and attributes. In this regard, we propose a variational co-embedding learning model for attributed network clustering (VCLANC). VCLANC is composed of dual variational auto-encoders to simultaneously embed nodes and attributes. Relying on this, the mutual affinity information between nodes and attributes could be reconstructed from the embedding space and served as extra self-supervised knowledge for representation learning. At the same time, trainable Gaussian mixture model is used as priors to infer the node clustering assignments. T o strengthen the performance of the inferred clusters, we use a mutual distance loss on the centers of the Gaussian priors and a clustering assignment hardening loss on the node embeddings. Experimental results on four real-world attributed network datasets demonstrate the effectiveness of the proposed VCLANC for attributed network clustering. Finding accurate communities or clusters in an attributed network is critical to understand the complex network structures for many downstream applications like group recommendation, user-targeted online advertising, and disease protein discovery [1], [2], [3]. Though the clustering for attributed network brings many important applications, but it also poses significant new challenges. Firstly, as the attributed network includes not only structural connections but also attribute values, it is difficult to naturally combine and leverage the two types of information in the process of clustering [4]. Furthermore, it is usually hard to find supervision information to guide the cluster discovery in attributed network [5], [6], [7], [8]. T o handle the challenges, many network embedding and graph neural network related methods have been developed recently for node representation learning to improve the accuracy of downstream applications like graph classification, link prediction and graph clustering.
Exact and Approximate Hierarchical Clustering Using A*
Greenberg, Craig S., Macaluso, Sebastian, Monath, Nicholas, Dubey, Avinava, Flaherty, Patrick, Zaheer, Manzil, Ahmed, Amr, Cranmer, Kyle, McCallum, Andrew
Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel \emph{trellis} data structure. This combination results in an exact algorithm that scales beyond previous state of the art, from a search space with $10^{12}$ trees to $10^{15}$ trees, and an approximate algorithm that improves over baselines, even in enormous search spaces that contain more than $10^{1000}$ trees. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.