Clustering
Graph Spectral Feature Learning for Mixed Data of Categorical and Numerical Type
Sahoo, Saswata, Chakraborty, Souradip
Feature learning in the presence of a mixed type of variables, numerical and categorical types, is an important issue for related modeling problems. For simple neighborhood queries under mixed data space, standard practice is to consider numerical and categorical variables separately and combining them based on some suitable distance functions. Alternatives, such as Kernel learning or Principal Component do not explicitly consider the inter-dependence structure among the mixed type of variables. In this work, we propose a novel strategy to explicitly model the probabilistic dependence structure among the mixed type of variables by an undirected graph. Spectral decomposition of the graph Laplacian provides the desired feature transformation. The Eigen spectrum of the transformed feature space shows increased separability and more prominent clusterability among the observations. The main novelty of our paper lies in capturing interactions of the mixed feature type in an unsupervised framework using a graphical model. We numerically validate the implications of the feature learning strategy
Internal Audit Applications of AI: It Doesn't Have to Be Complicated to Be Effective - The Protiviti View
For many internal auditors, artificial intelligence (AI) may seem like a daunting topic to tackle -- but that shouldn't stop them from considering how they can apply it to their work. Tools and techniques exist that can provide auditors with powerful, straightforward techniques to enhance their work. With an increased focus and urgency around the use of data to support internal audit activities, the time for next-generation pursuits, such as use of AI, is now. Following up on a previous blog post discussing the basics of AI for auditors, here we offer our thoughts on how internal audit organizations can get started with AI methods, such as machine learning (ML), to increase efficiency and coverage, better assign resources to areas that matter most, deliver more insight and even help identify leading indicators of risk. We also offer a specific example of ML applied to internal audit. Machine Learning Doesn't Have to Be Complex ML is an application of AI in which the system itself is designed with the ability to learn and improve from experience.
Stochastic Sparse Subspace Clustering
Chen, Ying, Li, Chun-Guang, You, Chong
State-of-the-art subspace clustering methods are based on self-expressive model, which represents each data point as a linear combination of other data points. By enforcing such representation to be sparse, sparse subspace clustering is guaranteed to produce a subspace-preserving data affinity where two points are connected only if they are from the same subspace. On the other hand, however, data points from the same subspace may not be well-connected, leading to the issue of over-segmentation. We introduce dropout to address the issue of over-segmentation, which is based on randomly dropping out data points in self-expressive model. In particular, we show that dropout is equivalent to adding a squared $\ell_2$ norm regularization on the representation coefficients, therefore induces denser solutions. Then, we reformulate the optimization problem as a consensus problem over a set of small-scale subproblems. This leads to a scalable and flexible sparse subspace clustering approach, termed Stochastic Sparse Subspace Clustering, which can effectively handle large scale datasets. Extensive experiments on synthetic data and real world datasets validate the efficiency and effectiveness of our proposal.
Integrated Time Series Summarization and Prediction Algorithm and its Application to COVID-19 Data Mining
This paper proposes a simple method to extract from a set of multiple related time series a compressed representation for each time series based on statistics for the entire set of all time series. This is achieved by a hierarchical algorithm that first generates an alphabet of shapelets based on the segmentation of centroids for clustered data, before labels of these shapelets are assigned to the segmentation of each single time series via nearest neighbor search using unconstrained dynamic time warping as distance measure to deal with non-uniform time series lenghts. Thereby, a sequence of labels is assigned for each time series. Completion of the last label sequence permits prediction of individual time series. Proposed method is evaluated on two global COVID-19 datasets, first, for the number of daily net cases (daily new infections minus daily recoveries), and, second, for the number of daily deaths attributed to COVID-19 as of April 27, 2020. The first dataset involves 249 time series for different countries, each of length 96. The second dataset involves 264 time series, each of length 96. Based on detected anomalies in available data a decentralized exit strategy from lockdowns is advocated.
A Benchmark Study on Time Series Clustering
Javed, Ali, Lee, Byung Suk, Rizzo, Dona M.
This paper presents the first time series clustering benchmark utilizing all time series datasets currently available in the University of California Riverside (UCR) archive -- the state of the art repository of time series data. Specifically, the benchmark examines eight popular clustering methods representing three categories of clustering algorithms (partitional, hierarchical and density-based) and three types of distance measures (Euclidean, dynamic time warping, and shape-based). We lay out six restrictions with special attention to making the benchmark as unbiased as possible. A phased evaluation approach was then designed for summarizing dataset-level assessment metrics and discussing the results. The benchmark study presented can be a useful reference for the research community on its own; and the dataset-level assessment metrics reported may be used for designing evaluation frameworks to answer different research questions.
A Neuromorphic Paradigm for Online Unsupervised Clustering
A computational paradigm based on neuroscientific concepts is proposed and shown to be capable of online unsupervised clustering. Because it is an online method, it is readily amenable to streaming realtime applications and is capable of dynamically adjusting to macro-level input changes. All operations, both training and inference, are localized and efficient. The paradigm is implemented as a cognitive column that incorporates five key elements: 1) temporal coding, 2) an excitatory neuron model for inference, 3) winner-take-all inhibition, 4) a column architecture that combines excitation and inhibition, 5) localized training via spike timing de-pendent plasticity (STDP). These elements are described and discussed, and a prototype column is given. The prototype column is simulated with a semi-synthetic benchmark and is shown to have performance characteristics on par with classic k-means. Simulations reveal the inner operation and capabilities of the column with emphasis on excitatory neuron response functions and STDP implementations.
Target specific mining of COVID-19 scholarly articles using one-class approach
Sonbhadra, Sanjay Kumar, Agarwal, Sonali, Nagabhushan, P.
In recent years, several research articles have been published in the field of corona-virus caused diseases like severe acute respiratory syndrome (SARS), middle east respiratory syndrome (MERS) and COVID-19. In the presence of numerous research articles, extracting best-suited articles is time-consuming and manually impractical. The objective of this paper is to extract the activity and trends of corona-virus related research articles using machine learning approaches. The COVID-19 open research dataset (CORD-19) is used for experiments, whereas several target-tasks along with explanations are defined for classification, based on domain knowledge. Clustering techniques are used to create the different clusters of available articles, and later the task assignment is performed using parallel one-class support vector machines (OCSVMs). Experiments with original and reduced features validate the performance of the approach. It is evident that the k-means clustering algorithm, followed by parallel OCSVMs, outperforms other methods for both original and reduced feature space.
Concept Drift Detection via Equal Intensity k-means Space Partitioning
Zhang, Anjin Liu Jie Lu Guangquan
Data stream poses additional challenges to statistical classification tasks because distributions of the training and target samples may differ as time passes. Such distribution change in streaming data is called concept drift. Numerous histogram-based distribution change detection methods have been proposed to detect drift. Most histograms are developed on grid-based or tree-based space partitioning algorithms which makes the space partitions arbitrary, unexplainable, and may cause drift blind-spots. There is a need to improve the drift detection accuracy for histogram-based methods with the unsupervised setting. To address this problem, we propose a cluster-based histogram, called equal intensity k-means space partitioning (EI-kMeans). In addition, a heuristic method to improve the sensitivity of drift detection is introduced. The fundamental idea of improving the sensitivity is to minimize the risk of creating partitions in distribution offset regions. Pearson's chi-square test is used as the statistical hypothesis test so that the test statistics remain independent of the sample distribution. The number of bins and their shapes, which strongly influence the ability to detect drift, are determined dynamically from the sample based on an asymptotic constraint in the chi-square test. Accordingly, three algorithms are developed to implement concept drift detection, including a greedy centroids initialization algorithm, a cluster amplify-shrink algorithm, and a drift detection algorithm. For drift adaptation, we recommend retraining the learner if a drift is detected. The results of experiments on synthetic and real-world datasets demonstrate the advantages of EI-kMeans and show its efficacy in detecting concept drift.
Non-Exhaustive, Overlapping Co-Clustering: An Extended Analysis
Whang, Joyce Jiyoung, Dhillon, Inderjit S.
The goal of co-clustering is to simultaneously identify a clustering of rows as well as columns of a two dimensional data matrix. A number of co-clustering techniques have been proposed including information-theoretic co-clustering and the minimum sum-squared residue co-clustering method. However, most existing co-clustering algorithms are designed to find pairwise disjoint and exhaustive co-clusters while many real-world datasets contain not only a large overlap between co-clusters but also outliers which should not belong to any co-cluster. In this paper, we formulate the problem of Non-Exhaustive, Overlapping Co-Clustering where both of the row and column clusters are allowed to overlap with each other and outliers for each dimension of the data matrix are not assigned to any cluster. To solve this problem, we propose intuitive objective functions, and develop an an efficient iterative algorithm which we call the NEO-CC algorithm. We theoretically show that the NEO-CC algorithm monotonically decreases the proposed objective functions. Experimental results show that the NEO-CC algorithm is able to effectively capture the underlying co-clustering structure of real-world data, and thus outperforms state-of-the-art clustering and co-clustering methods. This manuscript includes an extended analysis of [21].
Chronnet: a network-based model for spatiotemporal data analysis
Ferreira, Leonardo N., Vega-Oliveros, Didier A., Cotacallapa, Moshe, Cardoso, Manoel F., Quiles, Marcos G., Zhao, Liang, Macau, Elbert E. N.
The amount and size of spatiotemporal data sets from different domains have been rapidly increasing in the last years, which demands the development of robust and fast methods to analyze and extract information from them. In this paper, we propose a network-based model for spatiotemporal data analysis called chronnet. It consists of dividing a geometrical space into grid cells represented by nodes connected chronologically. The main goal of this model is to represent consecutive recurrent events between cells with strong links in the network. This representation permits the use of network science and graphing mining tools to extract information from spatiotemporal data. The chronnet construction process is fast, which makes it suitable for large data sets. In this paper, we describe how to use our model considering artificial and real data. For this purpose, we propose an artificial spatiotemporal data set generator to show how chronnets capture not just simple statistics, but also frequent patterns, spatial changes, outliers, and spatiotemporal clusters. Additionally, we analyze a real-world data set composed of global fire detections, in which we describe the frequency of fire events, outlier fire detections, and the seasonal activity, using a single chronnet.