AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Quantized Compressive K-Means

Schellekens, Vincent, Jacques, Laurent

arXiv.org Machine LearningApr-26-2018

The recent framework of compressive statistical learning aims at designing tractable learning algorithms that use only a heavily compressed representation-or sketch-of massive datasets. Compressive K-Means (CKM) is such a method: it estimates the centroids of data clusters from pooled, non-linear, random signatures of the learning examples. While this approach significantly reduces computational time on very large datasets, its digital implementation wastes acquisition resources because the learning examples are compressed only after the sensing stage. The present work generalizes the sketching procedure initially defined in Compressive K-Means to a large class of periodic nonlinearities including hardware-friendly implementations that compressively acquire entire datasets. This idea is exemplified in a Quantized Compressive K-Means procedure, a variant of CKM that leverages 1-bit universal quantization (i.e. retaining the least significant bit of a standard uniform quantizer) as the periodic sketch nonlinearity. Trading for this resource-efficient signature (standard in most acquisition schemes) has almost no impact on the clustering performances, as illustrated by numerical experiments.

artificial intelligence, machine learning, sketch, (16 more...)

arXiv.org Machine Learning

1804.10109

Country: Europe (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

RULLS: Randomized Union of Locally Linear Subspaces for Feature Engineering

Lokare, Namita, Silva, Jorge, Kabul, Ilknur Kaynar

arXiv.org Machine LearningApr-25-2018

Feature engineering plays an important role in the success of a machine learning model. Most of the effort in training a model goes into data preparation and choosing the right representation. In this paper, we propose a robust feature engineering method, Randomized Union of Locally Linear Subspaces (RULLS). We generate sparse, non-negative, and rotation invariant features in an unsupervised fashion. RULLS aggregates features from a random union of subspaces by describing each point using globally chosen landmarks. These landmarks serve as anchor points for choosing subspaces. Our method provides a way to select features that are relevant in the neighborhood around these chosen landmarks. Distances from each data point to $k$ closest landmarks are encoded in the feature matrix. The final feature representation is a union of features from all chosen subspaces. The effectiveness of our algorithm is shown on various real-world datasets for tasks such as clustering and classification of raw data and in the presence of noise. We compare our method with existing feature generation methods. Results show a high performance of our method on both classification and clustering tasks.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Machine Learning

1804.0977

Country: North America > United States > North Carolina (0.14)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Revealing patterns in HIV viral load data and classifying patients via a novel machine learning cluster summarization method

Farooq, Samir, Weisenthal, Samuel J., Trayhan, Melissa, White, Robert J., Bush, Kristen, Mariuz, Peter R., Zand, Martin S.

arXiv.org Machine LearningApr-25-2018

HIV RNA viral load (VL) is an important outcome variable in studies of HIV infected persons. There exists only a handful of methods which classify patients by viral load patterns. Most methods place limits on the use of viral load measurements, are often specific to a particular study design, and do not account for complex, temporal variation. To address this issue, we propose a set of four unambiguous computable characteristics (features) of time-varying HIV viral load patterns, along with a novel centroid-based classification algorithm, which we use to classify a population of 1,576 HIV positive clinic patients into one of five different viral load patterns (clusters) often found in the literature: durably suppressed viral load (DSVL), sustained low viral load (SLVL), sustained high viral load (SHVL), high viral load suppression (HVLS), and rebounding viral load (RVL). The centroid algorithm summarizes these clusters in terms of their centroids and radii. We show that this allows new viral load patterns to be assigned pattern membership based on the distance from the centroid relative to its radius, which we term radial normalization classification. This method has the benefit of providing an objective and quantitative method to assign viral load pattern membership with a concise and interpretable model that aids clinical decision making. This method also facilitates meta-analyses by providing computably distinct HIV categories. Finally we propose that this novel centroid algorithm could also be useful in the areas of cluster comparison for outcomes research and data reduction in machine learning.

artificial intelligence, machine learning, viral load, (16 more...)

arXiv.org Machine Learning

1804.11195

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology > HIV (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

HG-means: A scalable hybrid genetic algorithm for minimum sum-of-squares clustering

Gribel, Daniel, Vidal, Thibaut

arXiv.org Machine LearningApr-25-2018

Minimum sum-of-squares clustering (MSSC) is a widely used clustering model, of which the popular K-means algorithm constitutes a local minimizer. It is well known that the solutions of K-means can be arbitrarily distant from the true MSSC global optimum, and dozens of alternative heuristics have been proposed for this problem. However, no other algorithm has been commonly adopted in the literature. This may be related to differences of computational effort, or to the assumption that a better solution of the MSSC has only a minor impact on classification or generalization capabilities. In this article, we dispute this belief. We introduce an efficient population-based metaheuristic that uses K-means as a local search in combination with problem-tailored crossover, mutation, and diversification operators. This algorithm can be interpreted as a multi-start K-means, in which the initial center positions are carefully sampled based on the search history. The approach is scalable and accurate, outperforming all recent state-of-the-art algorithms for MSSC in terms of solution quality, measured by the depth of local minima. This enhanced accuracy leads to classification results which are significantly closer to the ground truth than those of other algorithms, for overlapping Gaussian-mixture datasets with a large number of features and clusters. Therefore, improved global optimization methods appear to be essential to better exploit the MSSC model in high dimension.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Machine Learning

1804.09813

Country: North America > United States (1.00)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.97)

Add feedback

Complex Network Analysis of Men Single ATP Tennis Matches

Michieli, Umberto

arXiv.org Artificial IntelligenceApr-22-2018

During the last decades Network Science field has been rediscovered and addressed as the "new science" [2], [3]. A lot of issues have been (re-)examined thanks to Network Science techniques, which are nowadays permeating the way we face the world as a unique interconnected component. The presence and the immediate availability of a huge amount of digital data describing every kind of network and the way in which its nodes interact, has made possible an interdisciplinary analysis of many large-scale systems. Similar techniques have been recently applied also to professional sports, in order to discover complex interactions phenomena and universal rules which are almost invisible and difficult to recognize restricting the attention to small networks or to microscopic level. For example, complexnetwork analysis were conducted on soccer (e.g. in [4] and [5]), football ([6] and [7]), basket ([8] and [9]), baseball ([10]) and cricket ([11] and [12]), just to name a few. In professional tennis as well, there are few studies examining how to map matches into complex networks and then developing new ranking methods alternative to the ATP (Association of Tennis Professionals) official one. The first work of this kind is represented by [13], where the authors explained the network generation and then they performed some simple analysis on single Grand Slams tournaments matches only (i.e.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

1804.08138

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports > Tennis (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Communications > Networks (0.64)

Add feedback

Robust and scalable learning of data manifolds with complex topologies via ElPiGraph

Albergante, Luca, Mirkes, Evgeny M., Chen, Huidong, Martin, Alexis, Faure, Louis, Barillot, Emmanuel, Pinello, Luca, Gorban, Alexander N., Zinovyev, Andrei

arXiv.org Machine LearningApr-20-2018

We present ElPiGraph, a method for approximating data distributions having non-trivial topological features such as the existence of excluded regions or branching structures. Unlike many existing methods, ElPiGraph is not based on the construction of a k-nearest neighbour graph, a procedure that can perform poorly in the case of multidimensional and noisy data. Instead, ElPiGraph constructs elastic principal graphs in a more robust way by minimizing elastic energy, applying graph grammars and explicitly controlling topological complexity. Using trimmed approximation error function makes ElPiGraph extremely robust to the presence of background noise without decreasing computational performance and allows it to deal with complex cases of manifold learning (for example, ElPiGraph can learn disconnected intersecting manifolds). Thanks to the quasi-quadratic nature of the elastic function, ElPiGraph performs almost as fast as a simple k-means clustering and, therefore, is much more scalable than alternative methods, and can work on large datasets containing millions of high dimensional points on a personal computer. The excellent performance of the method opens the possibility to apply resampling and to approximate complex data structures via principal graph ensembles which can be used to construct consensus principal graphs. ElPiGraph is currently implemented in five programming languages and accompanied by a graphical user interface, which makes it a versatile tool to deal with complex data in various fields from molecular biology, where it can be used to infer pseudo-time trajectories from single-cell RNASeq, to astronomy, where it can be used to approximate complex structures in the distribution of galaxies.

artificial intelligence, elpigraph, machine learning, (18 more...)

arXiv.org Machine Learning

1804.0758

Country:

Europe (0.93)
North America > United States > Massachusetts (0.28)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Clustering Analysis on Locally Asymptotically Self-similar Processes

Peng, Qidi, Rao, Nan, Zhao, Ran

arXiv.org Machine LearningApr-19-2018

We show a sufficient condition on the dissimilarity measure that leads to the consistency of the algorithms for clustering offline and online data settings, respectively. As an example of application, clustering synthetic data sampled from multifractional Brownian motions is provided.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1804.06234

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.93)

Genre: Research Report (0.82)

Industry:

Banking & Finance > Trading (1.00)
Government (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A stigmergy-based analysis of city hotspots to discover trends and anomalies in urban transportation usage

Alfeo, Antonio L., Cimino, Mario G. C. A., Egidi, Sara, Lepri, Bruno, Vaglini, Gigliola

arXiv.org Artificial IntelligenceApr-16-2018

A key aspect of a sustainable urban transportation system is the effectiveness of transportation policies. To be effective, a policy has to consider a broad range of elements, such as pollution emission, traffic flow, and human mobility. Due to the complexity and variability of these elements in the urban area, to produce effective policies remains a very challenging task. With the introduction of the smart city paradigm, a widely available amount of data can be generated in the urban spaces. Such data can be a fundamental source of knowledge to improve policies because they can reflect the sustainability issues underlying the city. In this context, we propose an approach to exploit urban positioning data based on stigmergy, a bio-inspired mechanism providing scalar and temporal aggregation of samples. By employing stigmergy, samples in proximity with each other are aggregated into a functional structure called trail. The trail summarizes relevant dynamics in data and allows matching them, providing a measure of their similarity. Moreover, this mechanism can be specialized to unfold specific dynamics. Specifically, we identify high-density urban areas (i.e hotspots), analyze their activity over time, and unfold anomalies. Moreover, by matching activity patterns, a continuous measure of the dissimilarity with respect to the typical activity pattern is provided. This measure can be used by policy makers to evaluate the effect of policies and change them dynamically. As a case study, we analyze taxi trip data gathered in Manhattan from 2013 to 2015.

data mining, evolutionary algorithm, machine learning, (20 more...)

arXiv.org Artificial Intelligence

1804.05697

Country:

Asia (0.93)
Europe > Italy (0.68)
North America > United States > New York (0.15)

Genre: Research Report (0.64)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Government (1.00)
Transportation > Infrastructure & Services (0.90)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

Ten Machine Learning Algorithms You Should Know to Become a Data Scientist

#artificialintelligenceApr-15-2018, 14:11:17 GMT

Machine Learning Practitioners have different personalities. While some of them are "I am an expert in X and X can train on any type of data", where X some algorithm, some others are "Right tool for the right job people". A lot of them also subscribe to "Jack of all trades. Master of one" strategy, where they have one area of deep expertise and know slightly about different fields of Machine Learning. That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across.

algorithm, artificial intelligence, machine learning, (11 more...)

#artificialintelligence

Country: North America > United States > California > Santa Clara County > Palo Alto (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.38)

Add feedback

Adversarial Clustering: A Grid Based Clustering Algorithm Against Active Adversaries

Wei, Wutao, Xi, Bowei, Kantarcioglu, Murat

arXiv.org Machine LearningApr-13-2018

Nowadays more and more data are gathered for detecting and preventing cyber attacks. In cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. Most of the previous work focused on adversarial classification techniques, which assumed the existence of a reasonably large amount of carefully labeled data instances. However, in practice, labeling the data instances often requires costly and time-consuming human expertise and becomes a significant bottleneck. Meanwhile, a large number of unlabeled instances can also be used to understand the adversaries' behavior. To address the above mentioned challenges, in this paper, we develop a novel grid based adversarial clustering algorithm. Our adversarial clustering algorithm is able to identify the core normal regions, and to draw defensive walls around the centers of the normal objects utilizing game theoretic ideas. Our algorithm also identifies sub-clusters of attack objects, the overlapping areas within clusters, and outliers which may be potential anomalies.

data mining, defensive wall, machine learning, (16 more...)

arXiv.org Machine Learning

1804.0478

Country: North America > United States (0.67)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.34)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback