AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

An Analysis of Classical Multidimensional Scaling

Little, Anna, Xie, Yuying, Sun, Qiang

arXiv.org Machine LearningJan-15-2019

Classical multidimensional scaling is an important tool for dimension reduction in many applications. Yet few theoretical results characterizing its statistical performance exist. In this paper, we provide a theoretical framework for analyzing the quality of embedded samples produced by classical multidimensional scaling. This lays down the foundation for various downstream statistical analysis. As an application, we study its performance in the setting of clustering noisy data. Our results provide scaling conditions on the sample size, ambient dimensionality, between-class distance and noise level under which classical multidimensional scaling followed by a clustering algorithm can recover the cluster labels of all samples with high probability. Numerical simulations confirm these scaling conditions are sharp in low, moderate, and high dimensional regimes. Applications to both human RNAseq data and natural language data lend strong support to the methodology and theory.

multidimensional, probability, theorem 3, (15 more...)

arXiv.org Machine Learning

1812.11954

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Poland (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

A Data-Driven Approach for Discovery of Heat Load Patterns in District Heating

Calikus, Ece, Nowaczyk, Slawomir, Sant'Anna, Anita, Gadd, Henrik, Werner, Sven

arXiv.org Machine LearningJan-14-2019

Understanding the heat use of customers is crucial for effective district heating (DH) operations and management. Unfortunately, existing knowledge about customers and their heat load behaviors is quite scarce and very few studies have been focusing on this aspect. The deployment of smart meters offers a unique opportunity for researchers and DH utilities to analyze large-scale data and discover both typical, as well as atypical, patterns in the network. Heat load pattern discovery is a challenging task in DH systems, since a comprehensive analysis needs to involve many customers. Most of the past studies have relied on analysis of a small number of buildings, which are not shown to be picked as the representative examples. Therefore, the knowledge discovered in such studies is not enough to generalize for the entire network. In this work, we propose a data-driven approach that enables automatic discovery of heat load patterns in a complete district heating network. Our method clusters the buildings into different groups based on the characteristics of their load profiles, extracts the representative patterns for each of them, and detects abnormal profiles, i.e., the ones deviating from the expected behavior. We present the first comprehensive analysis of the heat load patterns by conducting a case study on all the buildings, in six customer categories, connected to two district heating networks in the south of Sweden. Our method has captured fifteen typical patterns among the heat load profiles of all buildings in our dataset. It shows that control strategies are not enough to explain the variability in the heat load behaviors. In conclusion, we demonstrate that the proposed approach has a great potential to develop knowledge about customers and their heat use habits in practice by automatically analyzing their typical and atypical profiles in large-scale.

heat load pattern, heat load profile, load profile, (12 more...)

arXiv.org Machine Learning

1901.04863

Country:

Europe > Sweden > Skåne County > Helsingborg (0.04)
Europe > Sweden > Halland County > Halmstad (0.04)
Europe > Sweden > Skåne County > Ängelholm (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Energy > Renewable (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Synthetic Data Generation: A must-have skill for new data scientists

#artificialintelligenceJan-12-2019, 03:57:53 GMT

Data is the new oil and truth be told only a few big players have the strongest hold on that currency. Googles and Facebooks of this world are so generous with their latest machine learning algorithms and packages (they give those away freely) because the entry barrier to the world of algorithms is pretty low right now. Open source has come a long way from being christened evil by the likes of Steve Ballmer to being an integral part of Microsoft. And plenty of open source initiatives are propelling the vehicles of data science, digital analytics, and machine learning. Standing in 2018 we can safely say that, algorithm, programming frameworks, and machine learning packages (or even tutorials and courses how to learn these techniques) are not the scarce resource but high-quality data is.

artificial intelligence, dataset, machine learning, (13 more...)

#artificialintelligence

Industry: Education (0.52)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.31)

Add feedback

Are Clusterings of Multiple Data Views Independent?

Gao, Lucy L., Bien, Jacob, Witten, Daniela

arXiv.org Machine LearningJan-12-2019

In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information. However, clustering the participants based on multiple data views implicitly assumes that a single underlying clustering of the participants is shared across all data views. If this assumption does not hold, then clustering the participants using multiple data views may lead to spurious results. In this paper, we seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop a new test for answering this question, which we then apply to clinical, proteomic, and metabolomic data, across two distinct timepoints, from the P100 study. We find that while the subgroups of the participants defined with respect to any single data type seem to be dependent across time, the clustering among the participants based on one data type (e.g. proteomic data) appears not to be associated with the clustering based on another data type (e.g. clinical data).

independence, likelihood ratio test, pseudo likelihood ratio test, (15 more...)

arXiv.org Machine Learning

1901.03905

Country:

North America > United States > California (0.14)
North America > United States > New York (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (0.67)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

An MBO scheme for clustering and semi-supervised clustering of signed networks

Cucuringu, Mihai, Pizzoferrato, Andrea, van Gennip, Yves

arXiv.org Machine LearningJan-10-2019

We introduce a principled method for the signed clustering problem, where the goal is to partition a graph whose edge weights take both positive and negative values, such that edges within the same cluster are mostly positive, while edges spanning across clusters are mostly negative. Our method relies on a graph-based diffuse interface model formulation utilizing the Ginzburg-Landau functional, based on an adaptation of the classic numerical Merriman-Bence-Osher (MBO) scheme for minimizing such graph-based functionals. The proposed objective function aims to minimize the total weight of inter-cluster positively-weighted edges, while maximizing the total weight of the inter-cluster negatively-weighted edges. Our method scales to large sparse networks, and can be easily adjusted to incorporate labelled data information, as is often the case in the context of semi-supervised learning. We tested our method on a number of both synthetic stochastic block models and real-world data sets (including financial correlation matrices), and obtained promising results that compare favourably against a number of state-of-the-art approaches from the recent literature.

graph, matrix, mbo scheme, (15 more...)

arXiv.org Machine Learning

1901.03091

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
(7 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Banking & Finance > Trading (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization

Weylandt, Michael, Nagorski, John, Allen, Genevera I.

arXiv.org Machine LearningJan-10-2019

Convex clustering is a promising new approach to the classical problem of clustering, combining strong performance in empirical studies with rigorous theoretical foundations. Despite these advantages, convex clustering has not been widely adopted, due to its computationally intensive nature and its lack of compelling visualizations. To address these impediments, we introduce Algorithmic Regularization, an innovative technique for obtaining high-quality estimates of regularization paths using an iterative one-step approximation scheme. We justify our approach with a novel theoretical result, guaranteeing global convergence of the approximate path to the exact solution under easily-checked non-data-dependent assumptions. The application of algorithmic regularization to convex clustering yields the Convex Clustering via Algorithmic Regularization Paths (CARP) algorithm for computing the clustering solution path. On example data sets from genomics and text analysis, CARP delivers over a 100-fold speed-up over existing methods, while attaining a finer approximation grid than standard methods. Furthermore, CARP enables improved visualization of clustering solutions: the fine solution grid returned by CARP can be used to construct a convex clustering-based dendrogram, as well as forming the basis of a dynamic path-wise visualization based on modern web technologies. Our methods are implemented in the open-source R package clustRviz, available at https://github.com/DataSlingers/clustRviz.

algorithm, clustering, convex, (14 more...)

arXiv.org Machine Learning

1901.01477

Country:

North America > United States > Washington > King County > Bellevue (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)
(9 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Government > Regional Government > North America Government > United States Government (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.95)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

Spectral Clustering via Ensemble Deep Autoencoder Learning (SC-EDAE)

Affeldt, Severine, Labiod, Lazhar, Nadif, Mohamed

arXiv.org Machine LearningJan-8-2019

Abstract--Recently, a number of works have studied clustering strategies that combine classical clustering algorithms and deep learning methods. These approaches follow either a sequential way, where a deep representation is learned using a deep autoencoder before obtaining clusters with k-means, or a simultaneous way,where deep representation and clusters are learned jointly by optimizing a single objective function. Both strategies improve clustering performance, however the robustness of these approaches is impeded by several deep autoencoder setting issues, among which the weights initialization, the width and number of layers or the number of epochs. To alleviate the impact of such hyperparameters setting on the clustering performance, we propose a new model which combines the spectral clustering and deep autoencoder strengths in an ensemble learning framework. Extensive experiments on various benchmark datasets demonstrate thepotential and robustness of our approach compared to state-of-the art deep clustering methods. I. INTRODUCTION Learning from large amount of data is a very challenging task. Several dimensionality reduction and clustering techniques thatare well studied in the literature aim to learn a suitable and simplified data representation from original dataset; see for instance [1-3]. While many approaches have been proposed to address the dimensionality reduction and clustering tasks, deep learning-based methods recently demonstrate promisingresults.

autoencoder, dataset, matrix, (16 more...)

arXiv.org Machine Learning

1901.02291

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)

Genre: Research Report (0.64)

Industry: Government > Regional Government > North America Government > United States Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Understanding partition comparison indices based on counting object pairs

Warrens, Matthijs J., van der Hoef, Hanneke

arXiv.org Machine LearningJan-7-2019

For example, in unsupervised machine learning, to evaluate theperformance of a clustering method, researchers typically assess agreement between a reference standard partition that purports to represent the true cluster structure of the objects (golden standard), and a trial partition produced by the method that is being evaluated (Wallace 1983; Halkidi, Batiskis and Vazirgiannis 2002; Jain 2010). High agreement between the two partitions may indicate good recovery of the true cluster structure. Agreement between partitions can be assessed with so-called external validity indices (Albatineh, Niewiadomska-Bugaj and Mihalko 2006; Brun et al. 2007; Warrens 2008a,2008b; Pfitzner et al. 2009). External validity indices can be roughly categorized into three approaches, namely 1) counting object pairs, 2) information theory (Vinh, Epps and Bailey 2010; Lei et al. 2016), and 3) matching sets (Rezaei and Fränti 2016). Most external validity indices are of the pair-counting approach, which is based on counting pairs of objects placed in identical and different clusters.

agreement, partition, rand index, (14 more...)

arXiv.org Machine Learning

1901.01777

Country:

North America > United States > New York (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Kansas (0.04)
(7 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Self-Expressive Subspace Clustering to Recognize Motion Dynamics of a Multi-Joint Coordination for Chronic Ankle Instability

Qian, Shaodi, Yen, Sheng-Che, Folmar, Eric, Chou, Chun-An

arXiv.org Machine LearningJan-6-2019

Ankle sprains and instability are major public health concerns. Up to 70% of individuals do not fully recover from a single ankle sprain and eventually develop chronic ankle instability (CAI). The diagnosis of CAI has been mainly based on self-report rather than objective biomechanical measures. The goal of this study is to quantitatively recognize the motion pattern of a multi-joint coordination using biosensor data from bilateral hip, knee, and ankle joints, and further distinguish between CAI and healthy cohorts. We propose an analytic framework, where a nonlinear subspace clustering method is developed to learn the motion dynamic patterns from an inter-connected network of multiply joints. A support vector machine model is trained with a leave-one-subject-out cross validation to validate the learned measures compared to traditional statistical measures. The computational results showed >70% classification accuracy on average based on the dataset of 48 subjects (25 with CAI and 23 normal controls) examined in our designed experiment. It is found that CAI can be observed from other joints (e.g., hips) significantly, which reflects the fact that there are interactions in the multi-joint coordination system. The developed method presents a potential to support the decisions with motion patterns during diagnosis, treatment, rehabilitation of gait abnormality caused by physical injury (e.g., ankle sprains in this study) or even central nervous system disorders.

chronic ankle instability, instability, subspace, (14 more...)

arXiv.org Machine Learning

1901.01558

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > New York (0.04)
Europe > Sweden (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

Add feedback

Combining Unsupervised and Supervised Learning for Asset Class Failure Prediction in Power Systems

Dong, Ming, Grumbach, L. S.

arXiv.org Machine LearningJan-5-2019

Abstract--In power systems, an asset class is a group of power equipment that has the same function and shares similar electrical or mechanical characteristics. Predicting failures for different asset classes is critical for electric utilities towards developing cost-effective asset management strategies. Previously, physical age based Weibull distribution has been widely used to failure prediction. However, this mathematical model cannot incorporate asset condition data such as inspection or testing results. As a result, the prediction cannot be very specific and accurate for individual assets. To solve this important problem, this paper proposes a novel and comprehensive data-driven approach based on asset condition data: K-means clustering as an unsupervised learning method is used to analyze the inner structure of historical asset condition data and produce the asset conditional ages; logistic regression as a supervised learning method takes in both asset physical ages and conditional ages to classify and predict asset statuses. Furthermore, an index called average aging rate is defined to quantify, track and estimate the relationship between asset physical age and conditional age. This approach was applied to an urban distribution system in West Canada to predict medium-voltage cable failures. Case studies and comparison with standard Weibull distribution are provided. The proposed approach demonstrates superior performance and practicality for predicting asset class failures in power systems. I. INTRODUCTION oday, more and more electric utilities are mandated by regulators to develop cost-effective long-term asset management strategies to reduce overall cost while maintaining system reliability [1-2]. Sophisticated and optimal asset management strategies can only be established based on the accurate prediction of asset failures in the future.

asset condition data, condition data, conditional age, (13 more...)

arXiv.org Machine Learning

1901.01985

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > New York (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
(4 more...)

Genre: Research Report > New Finding (0.35)

Industry: Energy > Power Industry > Utilities (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback