AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Query-oriented text summarization based on hypergraph transversals

Van Lierde, Hadrien, Chow, Tommy W. S.

arXiv.org Artificial IntelligenceFeb-2-2019

Existing graph- and hypergraph-based algorithms for document summarization represent the sentences of a corpus as the nodes of a graph or a hypergraph in which the edges represent relationships of lexical similarities between sentences. Each sentence of the corpus is then scored individually, using popular node ranking algorithms, and a summary is produced by extracting highly scored sentences. This approach fails to select a subset of jointly relevant sentences and it may produce redundant summaries that are missing important topics of the corpus. To alleviate this issue, a new hypergraph-based summarizer is proposed in this paper, in which each node is a sentence and each hyperedge is a theme, namely a group of sentences sharing a topic. Themes are weighted in terms of their prominence in the corpus and their relevance to a user-defined query. It is further shown that the problem of identifying a subset of sentences covering the relevant themes of the corpus is equivalent to that of finding a hypergraph transversal in our theme-based hypergraph. Two extensions of the notion of hypergraph transversal are proposed for the purpose of summarization, and polynomial time algorithms building on the theory of submodular functions are proposed for solving the associated discrete optimization problems. The worst-case time complexity of the proposed algorithms is squared in the number of terms, which makes it cheaper than the existing hypergraph-based methods. A thorough comparative analysis with related models on DUC benchmark datasets demonstrates the effectiveness of our approach, which outperforms existing graph- or hypergraph-based methods by at least 6% of ROUGE-SU4 score.

data mining, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

1902.00672

Country: Asia > China (0.28)

Genre:

Research Report (0.50)
Workflow (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

A Spatial-Temporal Decomposition Based Deep Neural Network for Time Series Forecasting

Asadi, Reza, Regan, Amelia

arXiv.org Machine LearningFeb-1-2019

Spatial time series forecasting problems arise in a broad range of applications, such as environmental and transportation problems. These problems are challenging because of the existence of specific spatial, short-term and long-term patterns, and the curse of dimensionality. In this paper, we propose a deep neural network framework for large-scale spatial time series forecasting problems. We explicitly designed the neural network architecture for capturing various types of patterns. In preprocessing, a time series decomposition method is applied to separately feed short-term, long-term and spatial patterns into different components of a neural network. A fuzzy clustering method finds cluster of neighboring time series based on similarity of time series residuals; as they can be meaningful short-term patterns for spatial time series. In neural network architecture, each kernel of a multi-kernel convolution layer is applied to a cluster of time series to extract short-term features in neighboring areas. The output of convolution layer is concatenated by trends and followed by convolution-LSTM layer to capture long-term patterns in larger regional areas. To make a robust prediction when faced with missing data, an unsupervised pretrained denoising autoencoder reconstructs the output of the model in a fine-tuning step. The experimental results illustrate the model outperforms baseline and state of the art models in a traffic flow prediction dataset.

neural network, sensor, time sery, (12 more...)

arXiv.org Machine Learning

1902.00636

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
North America > United States > California > Orange County > Irvine (0.04)
Asia (0.04)

Genre: Research Report (0.84)

Industry:

Energy (0.93)
Transportation > Infrastructure & Services (0.47)
Consumer Products & Services > Travel (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Normalized Wasserstein Distance for Mixture Distributions with Applications in Adversarial Learning and Domain Adaptation

Balaji, Yogesh, Chellappa, Rama, Feizi, Soheil

arXiv.org Machine LearningFeb-1-2019

Understanding proper distance measures between distributions is at the core of several learning tasks such as generative models, domain adaptation, clustering, etc. In this work, we focus on {\it mixture distributions} that arise naturally in several application domains where the data contains different sub-populations. For mixture distributions, established distance measures such as the Wasserstein distance do not take into account imbalanced mixture proportions. Thus, even if two mixture distributions have identical mixture components but different mixture proportions, the Wasserstein distance between them will be large. This often leads to undesired results in distance-based learning methods for mixture distributions. In this paper, we resolve this issue by introducing {\it Normalized Wasserstein} distance. The key idea is to introduce mixture proportions as optimization variables, effectively normalizing mixture proportions in the Wasserstein formulation. Using the proposed normalized Wasserstein distance, instead of the vanilla one, leads to significant gains working with mixture distributions with imbalanced mixture proportions. We demonstrate effectiveness of the proposed distance in GANs, domain adaptation, adversarial clustering and hypothesis testing over mixture of Gaussians, MNIST, CIFAR-10, CelebA and VISDA datasets.

dataset, proportion, wasserstein distance, (16 more...)

arXiv.org Machine Learning

1902.00415

Country:

North America > United States > Maryland > Prince George's County > College Park (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A Novel Initial Clusters Generation Method for K-means-based Clustering Algorithms for Mixed Datasets

Ahmad, Amir, Khan, Shehroz S.

arXiv.org Machine LearningJan-31-2019

Mixed datasets consist of numeric and categorical attributes. Various K-means-based clustering algorithms have been developed to cluster these datasets. Generally, these clustering algorithms use random initial clusters which in turn produce different clustering results in different runs. A few cluster initialisation methods have been developed to compute initial clusters, however, they are either computationally expensive or they do not create the same clustering results in different runs. In this paper, we propose a novel approach to find initial clusters for K-means-based clustering algorithms for mixed datasets. The proposed approach is based on the observation that some data points in datasets remain in the same clusters created by K-means-based clustering algorithm irrespective of the choice of initial clusters. It is proposed that individual attribute information can be used to create initial clusters. A K-means-based clustering algorithm is run many times, in each run one of the attributes is used to create initial clusters. The clustering results of various runs are combined to produce a clustering result. This clustering result is used as initial clusters for a K-means-based clustering algorithm. Experiments with various categorical and mixed datasets showed that the proposed clustering approach produced accurate and consistent results.

algorithm, dataset, initial cluster method, (13 more...)

arXiv.org Machine Learning

doi: 10.13140/RG.2.2.21979.62244

1902.00127

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland (0.04)
(7 more...)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Generalized Dirichlet-process-means for f-separable distortion measures

Kobayashi, Masahiro, Watanabe, Kazuho

arXiv.org Machine LearningJan-31-2019

DP-means clustering was obtained as an extension of K-means clustering. While it is implemented with a simple and efficient algorithm, it can estimate the number of clusters simultaneously. However, DP-means is specifically designed for the average distortion measure. Therefore, it is vulnerable to outliers in data, and it can cause large maximum distortion in clusters. In this work, we extend the objective function of the DP-means to f-separable distortion measures and propose a unified learning algorithm to overcome the above problems by the selection of the function f. Furthermore, the influence function of the estimated cluster center is analyzed to evaluate the robustness against outliers. We show the effectiveness of the generalized method by numerical experiments using real datasets.

distortion, distortion measure, objective function, (15 more...)

arXiv.org Machine Learning

1901.11331

Country: Asia > Japan (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

7 Steps to Mastering Basic Machine Learning with Python -- 2019 Edition

#artificialintelligenceJan-30-2019, 01:37:51 GMT

Then read Michael J. Garbade's Understanding K-means Clustering in Machine Learning and implement k-means for yourself. Then take a look at Gabriel Pierobon's DBSCAN clustering for data shapes k-means can't handle well (in Python) to implement a density-based clustering model. Now that we have sampled around, let's switch gears back to classification and check out a more complex algorithm.

artificial intelligence, machine learning, python, (13 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

catch22: CAnonical Time-series CHaracteristics

Lubba, Carl H, Sethi, Sarab S, Knaute, Philip, Schultz, Simon R, Fulcher, Ben D, Jones, Nick S

arXiv.org Machine LearningJan-30-2019

Capturing the dynamical properties of time series concisely as interpretable feature vectors can enable efficient clustering and classification for time-series applications across science and industry. Selecting an appropriate feature-based representation of time series for a given application can be achieved through systematic comparison across a comprehensive time-series feature library, such as those in the hctsa toolbox. However, this approach is computationally expensive and involves evaluating many similar features, limiting the widespread adoption of feature-based representations of time series for real-world applications. In this work, we introduce a method to infer small sets of time-series features that (i) exhibit strong classification performance across a given collection of time-series problems, and (ii) are minimally redundant. Applying our method to a set of 93 time-series classification datasets (containing over 147000 time series) and using a filtered version of the hctsa feature library (4791 features), we introduce a generically useful set of 22 CAnonical Time-series CHaracteristics, catch22. This dimensionality reduction, from 4791 to 22, is associated with an approximately 1000-fold reduction in computation time and near linear scaling with time-series length, despite an average reduction in classification accuracy of just 7%. catch22 captures a diverse and interpretable signature of time series in terms of their properties, including linear and non-linear autocorrelation, successive differences, value distributions and outliers, and fluctuation scaling properties. We provide an efficient implementation of catch22, accessible from many programming environments, that facilitates feature-based time-series analysis for scientific, industrial, financial and medical applications using a common language of interpretable time-series properties.

accuracy, dataset, time sery, (13 more...)

arXiv.org Machine Learning

1901.102

Country:

Asia > Middle East > UAE (0.04)
Oceania > Australia (0.04)
North America > United States > Massachusetts > Middlesex County > Natick (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.66)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

Splitting Methods for Convex Bi-Clustering and Co-Clustering

Weylandt, Michael

arXiv.org Machine LearningJan-30-2019

Co-Clustering, the problem of simultaneously identifying clusters across multiple aspects of a data set, is a natural generalization of clustering to higher-order structured data. Recent convex formulations of bi-clustering and tensor co-clustering, which shrink estimated centroids together using a convex fusion penalty, allow for global optimality guarantees and precise theoretical analysis, but their computational properties have been less well studied. In this note, we present three efficient operator-splitting methods for the convex co-clustering problem: a standard two-block ADMM, a Generalized ADMM which avoids an expensive tensor Sylvester equation in the primal update, and a three-block ADMM based on the operator splitting scheme of Davis and Yin. Theoretical complexity analysis suggests, and experimental evidence confirms, that the Generalized ADMM is far more efficient for large problems.

admm, col, generalized admm, (14 more...)

arXiv.org Machine Learning

1901.06075

Country:

North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.95)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

Clustering with Jointly Learned Nonlinear Transforms Over Discriminating Min-Max Similarity/Dissimilarity Assignment

Kostadinov, Dimche, Razeghi, Behrooz, Holotyak, Taras, Voloshynovskiy, Slava

arXiv.org Machine LearningJan-30-2019

This paper presents a novel clustering concept that is based on jointly learned nonlinear transforms (NTs) with priors on the information loss and the discrimination. We introduce a clustering principle that is based on evaluation of a parametric min-max measure for the discriminative prior. The decomposition of the prior measure allows to break down the assignment into two steps. In the first step, we apply NTs to a data point in order to produce candidate NT representations. In the second step, we preform the actual assignment by evaluating the parametric measure over the candidate NT representations. Numerical experiments on image clustering task validate the potential of the proposed approach. The evaluation shows advantages in comparison to the state-of-the-art clustering methods.

assignment, nt representation, representation, (15 more...)

arXiv.org Machine Learning

1901.1076

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Geneva > Geneva (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Predictive Maintenance in Photovoltaic Plants with a Big Data Approach

Betti, Alessandro, Trovato, Maria Luisa Lo, Leonardi, Fabio Salvatore, Leotta, Giuseppe, Ruffini, Fabrizio, Lanzetta, Ciro

arXiv.org Machine LearningJan-29-2019

Fault prediction is offered at two different levels based on a data-driven approach: (a) generic fault/status prediction and (b) specific fault class prediction, implemented by means of two different machine learning based modules built on an unsupervised clustering algorithm and a Pattern Recognition Neural Network, respectively. Model has been assessed on a park of six photovoltaic (PV) plants up to 10 MW and on more than one hundred inverter modules of three different technology brands. The results indicate that the proposed method is effective in (a) predicting incipient generic faults up to 7 days in advance with sensitivity up to 95% and (b) anticipating damage of specific fault classes with times ranging from few hours up to 7 days. The model is easily deployable for online monitoring of anomalies on new PV plants and technologies, requiring only the availability of historical SCADA and fault data, fault taxonomy and inverter electrical datasheet. Keywords: Data Mining, Fault Prediction, Inverter Module, Key Performance Indicator, Lost Production 1 INTRODUCTION The provision of a Preventive Maintenance strategy is emerging nowadays as an essential field to keep high technical and economic performances of solar PV plants over time [1]. Analytical monitoring systems have been installed therefore worldwide to timely detect possible malfunctions through the assessment of PV system performances [2-3].

classification metric, pv plant, sensitivity, (14 more...)

arXiv.org Machine Learning

doi: 10.4229/EUPVSEC20172017-6DP.2.4

1901.10855

Country:

Europe > Greece (0.05)
Europe > Romania (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Italy > Lazio > Rome (0.04)

Genre: Research Report (0.82)

Industry: Energy > Renewable > Solar (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.54)

Add feedback