AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Inverted-File k-Means Clustering: Performance Analysis

Aoyama, Kazuo, Saito, Kazumi, Ikeda, Tetsuo

arXiv.org Machine LearningFeb-20-2020

This paper presents an inverted-file k-means clustering algorithm (IVF) suitable for a large-scale sparse data set with potentially numerous classes. Given such a data set, IVF efficiently works at high-speed and with low memory consumption, which keeps the same solution as a standard Lloyd's algorithm. The high performance arises from two distinct data representations. One is a sparse expression for both the object and mean feature vectors. The other is an inverted-file data structure for a set of the mean feature vectors. To confirm the effect of these representations, we design three algorithms using distinct data structures and expressions for comparison. We experimentally demonstrate that IVF achieves better performance than the designed algorithms when they are applied to large-scale real document data sets in a modern computer system equipped with superscalar out-of-order processors and a deep hierarchical memory system. We also introduce a simple yet practical clock-cycle per instruction (CPI) model for speed-performance analysis. Analytical results reveal that IVF suppresses three performance degradation factors: the numbers of cache misses, branch mispredictions, and the completed instructions.

algorithm, feature vector, instruction, (15 more...)

arXiv.org Machine Learning

2002.09094

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
Asia > Japan > Honshū > Chūbu > Shizuoka Prefecture > Shizuoka (0.04)
(6 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Embedding Graph Auto-Encoder with Joint Clustering via Adjacency Sharing

Li, Xuelong, Zhang, Hongyuan, Zhang, Rui

arXiv.org Machine LearningFeb-20-2020

Graph convolution networks have attracted many attentions and several graph auto-encoder based clustering models are developed for attributed graph clustering. However, most existing approaches separate clustering and optimization of graph auto-encoder into two individual steps. In this paper, we propose a graph convolution network based clustering model, namely, Embedding Graph Auto-Encoder with JOint Clustering via Adjacency Sharing (\textit{EGAE-JOCAS}). As for the embedded model, we develop a novel joint clustering method, which combines relaxed k-means and spectral clustering and is applicable for the learned embedding. The proposed joint clustering shares the same adjacency within graph convolution layers. Two parts are optimized simultaneously through performing SGD and taking close-form solutions alternatively to ensure a rapid convergence. Moreover, our model is free to incorporate any mechanisms (e.g., attention) into graph auto-encoder. Extensive experiments are conducted to prove the superiority of EGAE-JOCAS. Sufficient theoretical analyses are provided to support the results.

egae-joca, latexit latexit sha1, qvepcji1z kzx jts1bwx8mpn6bywzekaiujet, (14 more...)

arXiv.org Machine Learning

2002.08643

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

A Scalable Framework for Sparse Clustering Without Shrinkage

Zhang, Zhiyue, Lange, Kenneth, Xu, Jason

arXiv.org Machine LearningFeb-19-2020

Clustering, a fundamental activity in unsupervised learning, is notoriously difficult when the feature space is high-dimensional. Fortunately, in many realistic scenarios, only a handful of features are relevant in distinguishing clusters. This has motivated the development of sparse clustering techniques that typically rely on k-means within outer algorithms of high computational complexity. Current techniques also require careful tuning of shrinkage parameters, further limiting their scalability. In this paper, we propose a novel framework for sparse k-means clustering that is intuitive, simple to implement, and competitive with state-of-the-art algorithms. We show that our algorithm enjoys consistency and convergence guarantees. Our core method readily generalizes to several task-specific algorithms such as clustering on subsets of attributes and in partially observed data settings. We showcase these contributions via simulated experiments and benchmark datasets, as well as a case study on mouse protein expression.

algorithm, informative feature, skfr, (15 more...)

arXiv.org Machine Learning

2002.08541

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Entrywise convergence of iterative methods for eigenproblems

Charisopoulos, Vasileios, Benson, Austin R., Damle, Anil

arXiv.org Machine LearningFeb-19-2020

Several problems in machine learning, statistics, and other fields rely on computing eigenvectors. For large scale problems, the computation of these eigenvectors is typically performed via iterative schemes such as subspace iteration or Krylov methods. While there is classical and comprehensive analysis for subspace convergence guarantees with respect to the spectral norm, in many modern applications other notions of subspace distance are more appropriate. Recent theoretical work has focused on perturbations of subspaces measured in the $\ell_{2 \to \infty}$ norm, but does not consider the actual computation of eigenvectors. Here we address the convergence of subspace iteration when distances are measured in the $\ell_{2 \to \infty}$ norm and provide deterministic bounds. We complement our analysis with a practical stopping criterion and demonstrate its applicability via numerical experiments. Our results show that one can get comparable performance on downstream tasks while requiring fewer iterations, thereby saving substantial computational time.

dist 2, eigenvector, matrix, (17 more...)

arXiv.org Machine Learning

2002.08491

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New York > Tompkins County > Ithaca (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A Fixed point view: A Model-Based Clustering Framework

Ding, Jianhao, Han, Lansheng

arXiv.org Machine LearningFeb-19-2020

However, not all of the data are representative and meaningful, so the analysis and disposal of large-scale data occupies an increasingly important position in scientific research and social life [1]. Cluster analysis is an important unsupervised learning method in machine learning. Its basic idea is grouping a set of objects into clusters, in a way that objects in the same cluster share more similarity than those from separated clusters, in terms of distances of a certain space. In the evolution of clustering, due to the differences of data types and clustering strategies, cluster analysis can be divided into two main branches, namely, traditional clustering algorithms and modern clustering algorithms. Traditional clustering algorithms include clustering algorithm based on partition, density, model, fuzzy theory and so on [2, 3].

algorithm, contraction map, iteration, (17 more...)

arXiv.org Machine Learning

2002.08032

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hierarchical Correlation Clustering and Tree Preserving Embedding

Chehreghani, Morteza Haghir

arXiv.org Machine LearningFeb-18-2020

We propose a hierarchical correlation clustering method that extends the well-known correlation clustering to produce hierarchical clusters. We then investigate embedding the respective hierarchy to be used for (tree preserving) embedding and feature extraction. We study the connection of such an embedding to single linkage embedding and minimax distances, and in particular study minimax distances for correlation clustering. Finally, we demonstrate the performance of our methods on several UCI and 20 newsgroup datasets.

correlation, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2002.07756

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Kansas (0.04)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Network Clustering Via Kernel-ARMA Modeling and the Grassmannian The Brain-Network Case

Ye, Cong, Slavakis, Konstantinos, Patil, Pratik V., Nakuci, Johan, Muldoon, Sarah F., Medaglia, John

arXiv.org Machine LearningFeb-18-2020

Background Network clustering is the task of assigning nodes to groups via user-defined (statistical) "similarities" among nodal time series (signals), and is ubiquitous across a plethora of disciplines such as computer vision [1], wireless-sensor [2], social [3] and brain networks [4]. In brain networks, the choice of scale and type of data determine how networks are built. At the microscopic level, network nodes might be neurons, and edges could represent anatomical connections such as synapses (structural connectivity), or statistical relationships between firing patterns of neurons (functional connectivity). Similarly, at the macroscopic level, nodes can represent brain regions. At this scale, in structural networks, edges might represent long range anatomical connections between brain regions or, in functional networks, statistical relationships between regional brain dynamics recorded via functional Magnetic Resonance Imaging (fMRI) or encephalopathy (EEG). Here, we are interested in functional brain networks in which network nodes represent brain regions whose activity can be represented by a time series describing the dynamic evolution of brain activity.[5];

community detection, grassmannian, time sery, (12 more...)

arXiv.org Machine Learning

2002.09943

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > New York > Erie County > Buffalo (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections

Chatzimparmpas, Angelos, Martins, Rafael Messias, Kerren, Andreas

arXiv.org Machine LearningFeb-17-2020

t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of multidimensional data has proven to be a popular approach, with successful applications in a wide range of domains. Despite their usefulness, t-SNE projections can be hard to interpret or even misleading, which hurts the trustworthiness of the results. Understanding the details of t-SNE itself and the reasons behind specific patterns in its output may be a daunting task, especially for non-experts in dimensionality reduction. In this work, we present t-viSNE, an interactive tool for the visual exploration of t-SNE projections that enables analysts to inspect different aspects of their accuracy and meaning, such as the effects of hyper-parameters, distance and neighborhood preservation, densities and costs of specific neighborhoods, and the correlations between dimensions and visual patterns. We propose a coherent, accessible, and well-integrated collection of different views for the visualization of t-SNE projections. The applicability and usability of t-viSNE are demonstrated through hypothetical usage scenarios with real data sets. Finally, we present the results of a user study where the tool's effectiveness was evaluated. By bringing to light information that would normally be lost after running t-SNE, we hope to support analysts in using t-SNE and making its results better understandable.

dimension, neighborhood, projection, (15 more...)

arXiv.org Machine Learning

2002.0691

Country:

North America > United States > Wisconsin (0.04)
Europe > Portugal > Coimbra > Coimbra (0.04)
Europe > Sweden > Kronoberg County > Växjö (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.92)
Questionnaire & Opinion Survey (0.86)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Visualization (1.00)
Information Technology > Human Computer Interaction (1.00)
Information Technology > Data Science > Data Mining (1.00)
(2 more...)

Add feedback

Dimensionality Reduction and Motion Clustering during Activities of Daily Living: 3, 4, and 7 Degree-of-Freedom Arm Movements

Gloumakov, Yuri, Spiers, Adam J., Dollar, Aaron M.

arXiv.org Machine LearningFeb-16-2020

The wide variety of motions performed by the human arm during daily tasks makes it desirable to find representative subsets to reduce the dimensionality of these movements for a variety of applications, including the design and control of robotic and prosthetic devices. This paper presents a novel method and the results of an extensive human subjects study to obtain representative arm joint angle trajectories that span naturalistic motions during Activities of Daily Living (ADLs). In particular, we seek to identify sets of useful motion trajectories of the upper limb that are functions of a single variable, allowing, for instance, an entire prosthetic or robotic arm to be controlled with a single input from a user, along with a means to select between motions for different tasks. Data driven approaches are used to obtain clusters as well as representative motion averages for the full-arm 7 degree of freedom (DOF), elbow-wrist 4 DOF, and wrist-only 3 DOF motions. The proposed method makes use of well-known techniques such as dynamic time warping (DTW) to obtain a divergence measure between motion segments, DTW barycenter averaging (DBA) to obtain averages, Ward's distance criterion to build hierarchical trees, batch-DTW to simultaneously align multiple motion data, and functional principal component analysis (fPCA) to evaluate cluster variability. The clusters that emerge associate various recorded motions into primarily hand start and end location for the full-arm system, motion direction for the wrist-only system, and an intermediate between the two qualities for the elbow-wrist system. The proposed clustering methodology is justified by comparing results against alternative approaches.

dof model, motion segment, trajectory, (13 more...)

arXiv.org Machine Learning

2003.02641

Country:

North America > United States > New York (0.04)
North America > United States > Connecticut > New Haven County > New Haven (0.04)
North America > Canada > Quebec (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Health Care Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.40)

Add feedback

(Individual) Fairness for $k$-Clustering

Mahabadi, Sepideh, Vakilian, Ali

arXiv.org Machine LearningFeb-16-2020

We give a local search based algorithm for $k$-median ($k$-means) clustering from the perspective of individual fairness. More precisely, for a point $x$ in a point set $P$ of size $n$, let $r(x)$ be the minimum radius such that the ball of radius $r(x)$ centered at $x$ has at least $n/k$ points from $P$. Intuitively, if a set of $k$ random points are chosen from $P$ as centers, every point $x\in P$ expects to have a center within radius $r(x)$. An individually fair clustering provides such a guarantee for every point $x\in P$. This notion of fairness was introduced in [Jung et al., 2019] where they showed how to get an approximately feasible $k$-clustering with respect to this fairness condition. In this work, we show how to get an approximately optimal such fair $k$-clustering. The $k$-median ($k$-means) cost of our solution is within a constant factor of the cost of an optimal fair $k$-clustering, and our solution approximately satisfies the fairness condition (also within a constant factor). Further, we complement our theoretical bounds with empirical evaluation.

algorithm, denote, local search algorithm, (15 more...)

arXiv.org Machine Learning

2002.06742

Country:

North America > United States > California > Yolo County > Davis (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback