AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Choosing the Right Clustering Algorithm for your Dataset - KDnuggets

#artificialintelligenceDec-4-2019, 01:47:46 GMT

Data clustering is an essential step in the arrangement of a correct and throughout data model. To fulfill an analysis, the volume of information should be sorted out according to the commonalities. The main question is, what commonality parameter provides the best results – and what is implicated under "the best" definition at all. This article should be useful for the beginning data scientists, or for experts who want to refresh their memories on the topic. It includes the most widespread clustering algorithms, as well as their insightful review.

algorithm, dataset, right clustering algorithm, (14 more...)

#artificialintelligence

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Identifying the number of clusters for K-Means: A hypersphere density based approach

Nanjundan, Sukavanan, Sankaran, Shreeviknesh, Arjun, C. R., Anand, G. Paavai

arXiv.org Machine LearningDec-4-2019

Application of K-Means algorithm is restricted by the fact that the number of clusters should be known beforehand. Previously suggested methods to solve this problem are either ad hoc or require parametric assumptions and complicated calculations. The proposed method aims to solve this conundrum by considering cluster hypersphere density as the factor to determine the number of clusters in the given dataset. The density is calculated by assuming a hypersphere around the cluster centroid for n-different number of clusters. The calculated values are plotted against their corresponding number of clusters and then the optimum number of clusters is obtained after assaying the elbow region of the graph. The method is simple, easy to comprehend, and provides robust and reliable results.

cluster hypersphere density, hypersphere density, optimal number, (14 more...)

arXiv.org Machine Learning

1912.00643

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Japan (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Structure Learning with Similarity Preserving

Kang, Zhao, Lu, Xiao, Lu, Yiwei, Peng, Chong, Xu, Zenglin

arXiv.org Machine LearningDec-3-2019

Leveraging on the underlying low-dimensional structure of data, low-rank and sparse modeling approaches have achieved great success in a wide range of applications. However, in many applications the data can display structures beyond simply being low-rank or sparse. Fully extracting and exploiting hidden structure information in the data is always desirable and favorable. To reveal more underlying effective manifold structure, in this paper, we explicitly model the data relation. Specifically, we propose a structure learning framework that retains the pairwise similarities between the data points. Rather than just trying to reconstruct the original data based on self-expression, we also manage to reconstruct the kernel matrix, which functions as similarity preserving. Consequently, this technique is particularly suitable for the class of learning problems that are sensitive to sample similarity, e.g., clustering and semisupervised classification. To take advantage of representation power of deep neural network, a deep auto-encoder architecture is further designed to implement our model. Extensive experiments on benchmark data sets demonstrate that our proposed framework can consistently and significantly improve performance on both evaluation tasks. We conclude that the quality of structure learning can be enhanced if similarity information is incorporated. Introduction With the advancements in information technology, high-dimensional data become very common for representing the data. However, it is difficult to deal Preprint submitted to Elsevier December 4, 2019 arXiv:1912.01197v1 Fortunately, in practice data are not unstructured.

artificial intelligence, machine learning, similarity, (19 more...)

arXiv.org Machine Learning

1912.01197

Country:

Asia > China > Shandong Province > Qingdao (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Multi-view Subspace Clustering via Partition Fusion

Lv, Juncheng, Kang, Zhao, Wang, Boyu, Ji, Luping, Xu, Zenglin

arXiv.org Machine LearningDec-3-2019

Multi-view clustering is an important approach to analyze multi-view data in an unsupervised way. Among various methods, the multi-view subspace clustering approach has gained increasing attention due to its encouraging performance. Basically, it integrates multi-view information into graphs, which are then fed into spectral clustering algorithm for final result. Orthogonal to current work, we propose to fuse multi-view information in a partition space, which enhances the robustness of Multi-view clustering. Specifically, we generate multiple partitions and integrate them to find the shared partition. The proposed model unifies graph learning, generation of basic partitions, and view weight learning. We have conducted comprehensive experiments on benchmark datasets and our empirical results verify the effectiveness and robustness of our approach. Introduction In many real-world problems, data are collected from different sources in diverse domains or described by various feature collectors [1, 2, 3, 4, 5]. To process these kinds of data, a number of multi-view learning algorithms have been developed [8, 9, 10, 11, 12].

graph, partition, subspace, (15 more...)

arXiv.org Machine Learning

1912.01201

Country:

North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

A Dataset Schema for Cooperative Learning from Demonstration in Multi-robots Systems

Simões, Marco A. C., da Silva, Robson Marinho, Nogueira, Tatiane

arXiv.org Artificial IntelligenceDec-3-2019

To achieve these common goals, agents in a MAS should be capable of interacting with other agents, not simply by exchanging data, but by engaging as in social activities, such as those people participate in their daily lives: cooperation, coordination, negotiation, and the like. In MASs, agents are assumed to be autonomous - capable of making independent decisions about to do in order to satisfy their design objectives, and thus they need mechanisms that allow them to synchronize and to coordinate their activities at run time [31]. Although one of the main issues in MASs is the agents' coordination structure, this is not hard-wired at design time, as MASs are typically in standard concurrent/distributed systems. One well-known strategy for coordination in MAS is the design of multi-agent coordinated plans [7][35][36][33][14] that include, not only usual agents' actions defined by their effectors, but also communication actions to achieve the necessary synchronization and coordination. To represent communication actions, some specific languages were created, e.g.

agent, dataset, setplay, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s10846-019-01123-w

1912.01741

Country:

South America > Brazil > Bahia > Salvador (0.04)
South America > Brazil > São Paulo (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
(7 more...)

Genre:

Workflow (1.00)
Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Soccer (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.46)

Add feedback

Clustering via Ant Colonies: Parameter Analysis and Improvement of the Algorithm

Chavarria-Molina, Jeffry, Fallas-Monge, Juan Jose, Trejos-Zelaya, Javier

arXiv.org Machine LearningDec-2-2019

An ant colony optimization approach for partitioning a set of objects is proposed. In order to minimize the intra-variance, or within sum-of-squares, of the partitioned classes, we construct ant-like solutions by a constructive approach that selects objects to be put in a class with a probability that depends on the distance between the object and the centroid of the class (visibility) and the pheromone trail; the latter depends on the class memberships that have been defined along the iterations. The procedure is improved with the application of K-means algorithm in some iterations of the ant colony method. We performed a simulation study in order to evaluate the method with a Monte Carlo experiment that controls some sensitive parameters of the clustering problem. After some tuning of the parameters, the method has also been applied to some benchmark real-data sets. Encouraging results were obtained in nearly all cases.

artificial intelligence, evolutionary algorithm, machine learning, (19 more...)

arXiv.org Machine Learning

1912.01105

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Germany > Berlin (0.05)
North America > Costa Rica > Cartago Province > Cartago (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

On the optimality of kernels for high-dimensional clustering

Vankadara, Leena Chennuru, Ghoshdastidar, Debarghya

arXiv.org Machine LearningDec-1-2019

This paper studies the optimality of kernel methods in high-dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high-dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high-dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of $\sqrt{2}$ for large $k$. It also exactly matches the known upper bounds for the non-kernel setting. We also show that a semi-definite relaxation of the kernel k-means procedure matches up to constant factors, the spectral threshold, below which no polynomial-time algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel k-means as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in a higher-order analysis of kernel methods.

inequality, kernel, random variable, (16 more...)

arXiv.org Machine Learning

1912.00458

Country:

Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
South America > Brazil > São Paulo (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Data-Driven Optimization of Public Transit Schedule

Basak, Sanchita, Sun, Fangzhou, Sengupta, Saptarshi, Dubey, Abhishek

arXiv.org Machine LearningNov-29-2019

Bus transit systems are the backbone of public transportation in the United States. An important indicator of the quality of service in such infrastructures is on-time performance at stops, with published transit schedules playing an integral role governing the level of success of the service. However there are relatively few optimization architectures leveraging stochastic search that focus on optimizing bus timetables with the objective of maximizing probability of bus arrivals at timepoints with delays within desired on-time ranges. In addition to this, there is a lack of substantial research considering monthly and seasonal variations of delay patterns integrated with such optimization strategies. To address these, this paper makes the following contributions to the corpus of studies on transit on-time performance optimization: (a) an unsupervised clustering mechanism is presented which groups months with similar seasonal delay patterns, (b) the problem is formulated as a single-objective optimization task and a greedy algorithm, a genetic algorithm (GA) as well as a particle swarm optimization (PSO) algorithm are employed to solve it, (c) a detailed discussion on empirical results comparing the algorithms are provided and sensitivity analysis on hyper-parameters of the heuristics are presented along with execution times, which will help practitioners looking at similar problems. The analyses conducted are insightful in the local context of improving public transit scheduling in the Nashville metro region as well as informative from a global perspective as an elaborate case study which builds upon the growing corpus of empirical studies using nature-inspired approaches to transit schedule optimization. Keywords: timetable optimization · genetic algorithm · particle swarm optimization · sensitivity analysis · scheduling 1 Introduction Bus systems are the backbone of public transportation in the US, carrying over 47% of all public passenger trips and 19,380 million passenger miles in the US [18] . For the majority of cities in the US which do not have enough urban forms or budget to build expensive transit infrastructures like subways, the reliance is on buses as the most important transit system since bus systems have advantages arXiv:1912.02574v1

artificial intelligence, evolutionary algorithm, machine learning, (19 more...)

arXiv.org Machine Learning

1912.02574

Country:

North America > United States > Tennessee > Davidson County > Nashville (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > District of Columbia > Washington (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Latent space conditioning for improved classification and anomaly detection

Norlander, Erik, Sopasakis, Alexandros

arXiv.org Machine LearningNov-28-2019

We propose a new type of variational autoencoder to perform improved pre-processing for clustering and anomaly detection on data with a given label. Anomalies however are not known or labeled. We call our method conditional latent space variational autonencoder since it separates the latent space by conditioning on information within the data. The method fits one prior distribution to each class in the dataset, effectively expanding the prior distribution to include a Gaussian mixture model. Our approach is compared against the capabilities of a typical variational autoencoder by measuring their V-score during cluster formation with respect to the k-means and EM algorithms. For anomaly detection, we use a new metric composed of the mass-volume and excess-mass curves which can work in an unsupervised setting. We compare the results between established methods such as as isolation forest, local outlier factor and one-class support vector machine.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

1911.10599

Country:

North America > United States > Texas > Dallas County > Dallas (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Lifelong Spectral Clustering

Sun, Gan, Cong, Yang, Wang, Qianqian, Li, Jun, Fu, Yun

arXiv.org Machine LearningNov-28-2019

In the past decades, spectral clustering (SC) has become one of the most effective clustering algorithms. However, most previous studies focus on spectral clustering tasks with a fixed task set, which cannot incorporate with a new spectral clustering task without accessing to previously learned tasks. In this paper, we aim to explore the problem of spectral clustering in a lifelong machine learning framework, i.e., Lifelong Spectral Clustering (L2SC). Its goal is to efficiently learn a model for a new spectral clustering task by selectively transferring previously accumulated experience from knowledge library. Specifically, the knowledge library of L2SC contains two components: 1) orthogonal basis library: capturing latent cluster centers among the clusters in each pair of tasks; 2) feature embedding library: embedding the feature manifold information shared among multiple related tasks. As a new spectral clustering task arrives, L2SC firstly transfers knowledge from both basis library and feature library to obtain encoding matrix, and further redefines the library base over time to maximize performance across all the clustering tasks. Meanwhile, a general online update formulation is derived to alternatively update the basis library and feature library. Finally, the empirical experiments on several real-world benchmark datasets demonstrate that our L2SC model can effectively improve the clustering performance when comparing with other state-of-the-art spectral clustering algorithms.

dataset, learning, spectral, (17 more...)

arXiv.org Machine Learning

1911.11908

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Wisconsin (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry: Education (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)

Add feedback