AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Making AI Forget You: Data Deletion in Machine Learning

Ginart, Antonio, Guan, Melody, Valiant, Gregory, Zou, James

arXiv.org Machine LearningJul-11-2019

Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used -- the EU's Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of how to efficiently delete individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably deletion efficient algorithms which achieve an average of over 100X improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.

artificial intelligence, deletion, machine learning, (16 more...)

arXiv.org Machine Learning

1907.05012

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

The Price of Interpretability

Bertsimas, Dimitris, Delarue, Arthur, Jaillet, Patrick, Martin, Sebastien

arXiv.org Machine LearningJul-8-2019

When quantitative models are used to support decision-making on complex and important topics, understanding a model's ``reasoning'' can increase trust in its predictions, expose hidden biases, or reduce vulnerability to adversarial attacks. However, the concept of interpretability remains loosely defined and application-specific. In this paper, we introduce a mathematical framework in which machine learning models are constructed in a sequence of interpretable steps. We show that for a variety of models, a natural choice of interpretable steps recovers standard interpretability proxies (e.g., sparsity in linear models). We then generalize these proxies to yield a parametrized family of consistent measures of model interpretability. This formal definition allows us to quantify the ``price'' of interpretability, i.e., the tradeoff with predictive accuracy. We demonstrate practical algorithms to apply our framework on real and synthetic datasets.

artificial intelligence, interpretability, machine learning, (15 more...)

arXiv.org Machine Learning

1907.03419

Country:

North America > United States > Massachusetts (0.28)
North America > United States > New York (0.28)

Genre: Research Report (0.50)

Industry:

Health & Medicine (1.00)
Government (0.87)
Information Technology > Security & Privacy (0.87)
Education (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A Multi-Stage Clustering Framework for Automotive Radar Data

Scheiner, Nicolas, Appenrodt, Nils, Dickmann, Jürgen, Sick, Bernhard

arXiv.org Machine LearningJul-8-2019

Radar sensors provide a unique method for executing environmental perception tasks towards autonomous driving. Especially their capability to perform well in adverse weather conditions often makes them superior to other sensors such as cameras or lidar. Nevertheless, the high sparsity and low dimensionality of the commonly used detection data level is a major challenge for subsequent signal processing. Therefore, the data points are often merged in order to form larger entities from which more information can be gathered. The merging process is often implemented in form of a clustering algorithm. This article describes a novel approach for first filtering out static background data before applying a twostage clustering approach. The two-stage clustering follows the same paradigm as the idea for data association itself: First, clustering what is ought to belong together in a low dimensional parameter space, then, extracting additional features from the newly created clusters in order to perform a final clustering step. Parameters are optimized for filtering and both clustering steps. All techniques are assessed both individually and as a whole in order to demonstrate their effectiveness. Final results indicate clear benefits of the first two methods and also the cluster merging process under specific circumstances.

data mining, detection, machine learning, (18 more...)

arXiv.org Machine Learning

1907.03511

Country:

North America > United States (0.46)
Europe > Germany (0.28)

Genre: Research Report (0.84)

Industry:

Automobiles & Trucks > Manufacturer (0.46)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Visualization of Emergency Department Clinical Data for Interpretable Patient Phenotyping

Hurley, Nathan C., Haimovich, Adrian D., Taylor, R. Andrew, Mortazavi, Bobak J.

arXiv.org Machine LearningJul-5-2019

Visualization of Emergency Department Clinical Data for Interpretable Patient Phenotyping null Nathan C. Hurley a,, Adrian D. Haimovich b, R. Andrew Taylor b, Bobak J. Mortazavi a a Department of Computer Science and Engineering, T exas A&M University, United States b Department of Emergency Medicine, Y ale School of Medicine, United StatesAbstract Visual summarization of clinical data collected on patients contained within the electronic health record (EHR) may enable precise and rapid triage at the time of patient presentation to an emergency department (ED). The triage process is critical in the appropriate allocation of resources and in anticipating eventual patient disposition, typically admission to the hospital or discharge home. EHR data are high-dimensional and complex, but offer the opportunity to discover and characterize underlying data-driven patient phenotypes. Data-driven phenotypes are intended to relieve reliance on weak labels like diagnosis codes and to aid in identifying populations of existing patients that are most similar to a specific patient. These phenotypes will enable improved, personalized therapeutic decision making and prognostication. In this work, we focus on the challenge of two-dimensional patient projections. A low dimensional embedding offers visual interpretability lost in higher dimensions. While linear dimensionality reduction techniques such as principal component analysis are often used towards this aim, they are insufficient to describe the variance of patient data. This linear reduction does not account for higher order, nonlinear interactions of variables. In this work, we employ the newly-described nonlinear embedding technique called uniform manifold approximation and projection (UMAP). UMAP seeks to capture both local and global structures in high-dimensional data. We then use Gaussian mixture models to identify clusters in the embedded data and use the adjusted Rand index (ARI) to establish stability in the discovery of these clusters. This technique is applied to five common clinical chief complaints from a real-world ED EHR dataset, describing the emergent properties of discovered clusters. We observe clinically-relevant cluster attributes, suggesting that visual embeddings of EHR data using nonlinear dimensionality reduction is a promising approach to reveal data-driven patient phenotypes. In the five chief complaints, we find between 2 and 6 clusters, with the peak mean pairwise ARI between subsequent training iterations to range from 0.35 to 0.74. Introduction Electronic health records (EHRs) include heterogeneous data that represent past and ongoing patient care episodes.

artificial intelligence, bioinformatics, machine learning, (20 more...)

arXiv.org Machine Learning

1907.11039

Country: North America > United States > Florida > Hillsborough County > University (0.24)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.95)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Biomedical Informatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hybridized Threshold Clustering for Massive Data

Luo, Jianmei, Annakula, ChandraVyas, Kannamareddy, Aruna Sai, Sekhon, Jasjeet S., Hsu, William Henry, Higgins, Michael

arXiv.org Machine LearningJul-5-2019

As the size $n$ of datasets become massive, many commonly-used clustering algorithms (for example, $k$-means or hierarchical agglomerative clustering (HAC) require prohibitive computational cost and memory. In this paper, we propose a solution to these clustering problems by extending threshold clustering (TC) to problems of instance selection. TC is a recently developed clustering algorithm designed to partition data into many small clusters in linearithmic time (on average). Our proposed clustering method is as follows. First, TC is performed and clusters are reduced into single "prototype" points. Then, TC is applied repeatedly on these prototype points until sufficient data reduction has been obtained. Finally, a more sophisticated clustering algorithm is applied to the reduced prototype points, thereby obtaining a clustering on all $n$ data points. This entire procedure for clustering is called iterative hybridized threshold clustering (IHTC). Through simulation results and by applying our methodology on several real datasets, we show that IHTC combined with $k$-means or HAC substantially reduces the run time and memory usage of the original clustering algorithms while still preserving their performance. Additionally, IHTC helps prevent singular data points from being overfit by clustering algorithms.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

1907.02907

Country: North America > United States > California (0.28)

Genre: Research Report (0.40)

Industry:

Energy (0.68)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Feature-Based Image Clustering and Segmentation Using Wavelets

Chen, Junyu, Frey, Eric C.

arXiv.org Machine LearningJul-4-2019

Pixel intensity is a widely used feature for clustering and segmentation algorithms, the resulting segmentation using only intensity values might suffer from noises and lack of spatial context information. Wavelet transform is often used for image denoising and classification. We proposed a novel method to incorporate Wavelet features in segmentation and clustering algorithms. The conventional K-means, Fuzzy c-means (FCM), and Active contour without edges (ACWE) algorithms were modified to adapt Wavelet features, leading to robust clustering/segmentation algorithms. A weighting parameter to control the weight of low-frequency sub-band information was also introduced. The new algorithms showed the capability to converge to different segmentation results based on the frequency information derived from the Wavelet sub-bands.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1907.03591

Country:

North America > United States > Florida > Orange County > Orlando (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (0.69)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

k is the Magic Number -- Inferring the Number of Clusters Through Nonparametric Concentration Inequalities

Hess, Sibylle, Duivesteijn, Wouter

arXiv.org Machine LearningJul-4-2019

Most convex and nonconvex clustering algorithms come with one crucial parameter: the $k$ in $k$-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of $k$-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results. Our \textsc{SpecialK} clustering algorithm automatically determines the appropriate value for $k$, without requiring any data transformation or projection, and without assumptions on the data distribution. Additionally, it is capable to decide that the data consists of only a single cluster, which many existing algorithms cannot.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1907.02343

Country:

Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Locally Private k-Means Clustering

Stemmer, Uri

arXiv.org Machine LearningJul-4-2019

We design a new algorithm for the Euclidean $k$-means problem that operates in the local model of differential privacy. Unlike in the non-private literature, differentially private algorithms for the $k$-means incur both additive and multiplicative errors. Our algorithm significantly reduces the additive error while keeping the multiplicative error the same as in previous state-of-the-art results. Specifically, on a database of size $n$, our algorithm guarantees $O(1)$ multiplicative error and $\approx n^{1/2+a}$ additive error for an arbitrarily small constant $a$, whereas all previous algorithms in the local model on had additive error $\approx n^{2/3+a}$. We give a simple lower bound showing that additive error of $\approx\sqrt{n}$ is necessary for $k$-means algorithms in the local model (at least for algorithms with a constant number of interaction rounds, which is the setting we consider in this paper).

artificial intelligence, machine learning, probability, (17 more...)

arXiv.org Machine Learning

1907.02513

Country:

North America > United States > California (0.46)
North America > United States > New York (0.28)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)

Add feedback

Clustering of Medical Free-Text Records Based on Word Embeddings

Dobrakowski, Adam Gabriel, Mykowiecka, Agnieszka, Marciniak, Małgorzata, Jaworski, Wojciech, Biecek, Przemysław

arXiv.org Machine LearningJul-3-2019

Is it true that patients with similar conditions get similar diagnoses? In this paper we show NLP methods and a unique corpus of documents to validate this claim. We (1) introduce a method for representation of medical visits based on free-text descriptions recorded by doctors, (2) introduce a new method for clustering of patients' visits and (3) present an~application of the proposed method on a corpus of 100,000 visits. With the proposed method we obtained stable and separated segments of visits which were positively validated against final medical diagnoses. We show how the presented algorithm may be used to aid doctors during their practice.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

1907.04152

Country: Europe > Poland (0.14)

Genre:

Research Report (0.40)
Personal > Interview (0.34)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Learning graph-structured data using Poincar\'e embeddings and Riemannian K-means algorithms

Hajri, Hatem, Zaatiti, Hadi, Hebrail, Georges

arXiv.org Machine LearningJul-2-2019

Recent literature has shown several benefits of hyperbolic embedding of graph-structured data (GSD) in representing their structures and latent relations. While several studies have explored the ability of hyperbolic embedding to represent data (for example, by quantifying their mean average precision) and their ability to produce better visualisations of clusters, only few works exploited the effectiveness of hyperbolic embedding to perform learning on the initial GSD. Motivated by innovative ideas from the fields of Brain computer interfaces and Radar processing, this paper presents a new scheme for learning GSD based on hyperbolic embedding, Riemannian barycentre (i.e. Fr\'echet or geometric mean) and $K$-means algorithms as a significant tool that derives from it. The main idea is as follows. Relying on the Riemannian barycentre, we define a notion of minimal variance which allows us to choose an embedding between different ones. This embedding is used thereafter together with $K$-means algorithms to perform unsupervised clustering and in combination with the nearest neighbour rule to perform supervised learning. We demonstrate the performance of the proposed framework through several experiments on real-world social networks and hierarchical GSD. The obtained results outperform their counterparts in high-dimensional Euclidean spaces and recent proposed geometric approaches.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1907.01662

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > District of Columbia > Washington (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > France (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback