Goto

Collaborating Authors

 Clustering


Making AI Forget You: Data Deletion in Machine Learning

arXiv.org Machine Learning

Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used -- the EU's Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of how to efficiently delete individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably deletion efficient algorithms which achieve an average of over 100X improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.


The Price of Interpretability

arXiv.org Machine Learning

When quantitative models are used to support decision-making on complex and important topics, understanding a model's ``reasoning'' can increase trust in its predictions, expose hidden biases, or reduce vulnerability to adversarial attacks. However, the concept of interpretability remains loosely defined and application-specific. In this paper, we introduce a mathematical framework in which machine learning models are constructed in a sequence of interpretable steps. We show that for a variety of models, a natural choice of interpretable steps recovers standard interpretability proxies (e.g., sparsity in linear models). We then generalize these proxies to yield a parametrized family of consistent measures of model interpretability. This formal definition allows us to quantify the ``price'' of interpretability, i.e., the tradeoff with predictive accuracy. We demonstrate practical algorithms to apply our framework on real and synthetic datasets.


A Multi-Stage Clustering Framework for Automotive Radar Data

arXiv.org Machine Learning

Radar sensors provide a unique method for executing environmental perception tasks towards autonomous driving. Especially their capability to perform well in adverse weather conditions often makes them superior to other sensors such as cameras or lidar. Nevertheless, the high sparsity and low dimensionality of the commonly used detection data level is a major challenge for subsequent signal processing. Therefore, the data points are often merged in order to form larger entities from which more information can be gathered. The merging process is often implemented in form of a clustering algorithm. This article describes a novel approach for first filtering out static background data before applying a twostage clustering approach. The two-stage clustering follows the same paradigm as the idea for data association itself: First, clustering what is ought to belong together in a low dimensional parameter space, then, extracting additional features from the newly created clusters in order to perform a final clustering step. Parameters are optimized for filtering and both clustering steps. All techniques are assessed both individually and as a whole in order to demonstrate their effectiveness. Final results indicate clear benefits of the first two methods and also the cluster merging process under specific circumstances.


Visualization of Emergency Department Clinical Data for Interpretable Patient Phenotyping

arXiv.org Machine Learning

Visualization of Emergency Department Clinical Data for Interpretable Patient Phenotyping null Nathan C. Hurley a,, Adrian D. Haimovich b, R. Andrew Taylor b, Bobak J. Mortazavi a a Department of Computer Science and Engineering, T exas A&M University, United States b Department of Emergency Medicine, Y ale School of Medicine, United StatesAbstract Visual summarization of clinical data collected on patients contained within the electronic health record (EHR) may enable precise and rapid triage at the time of patient presentation to an emergency department (ED). The triage process is critical in the appropriate allocation of resources and in anticipating eventual patient disposition, typically admission to the hospital or discharge home. EHR data are high-dimensional and complex, but offer the opportunity to discover and characterize underlying data-driven patient phenotypes. Data-driven phenotypes are intended to relieve reliance on weak labels like diagnosis codes and to aid in identifying populations of existing patients that are most similar to a specific patient. These phenotypes will enable improved, personalized therapeutic decision making and prognostication. In this work, we focus on the challenge of two-dimensional patient projections. A low dimensional embedding offers visual interpretability lost in higher dimensions. While linear dimensionality reduction techniques such as principal component analysis are often used towards this aim, they are insufficient to describe the variance of patient data. This linear reduction does not account for higher order, nonlinear interactions of variables. In this work, we employ the newly-described nonlinear embedding technique called uniform manifold approximation and projection (UMAP). UMAP seeks to capture both local and global structures in high-dimensional data. We then use Gaussian mixture models to identify clusters in the embedded data and use the adjusted Rand index (ARI) to establish stability in the discovery of these clusters. This technique is applied to five common clinical chief complaints from a real-world ED EHR dataset, describing the emergent properties of discovered clusters. We observe clinically-relevant cluster attributes, suggesting that visual embeddings of EHR data using nonlinear dimensionality reduction is a promising approach to reveal data-driven patient phenotypes. In the five chief complaints, we find between 2 and 6 clusters, with the peak mean pairwise ARI between subsequent training iterations to range from 0.35 to 0.74. Introduction Electronic health records (EHRs) include heterogeneous data that represent past and ongoing patient care episodes.


Hybridized Threshold Clustering for Massive Data

arXiv.org Machine Learning

As the size $n$ of datasets become massive, many commonly-used clustering algorithms (for example, $k$-means or hierarchical agglomerative clustering (HAC) require prohibitive computational cost and memory. In this paper, we propose a solution to these clustering problems by extending threshold clustering (TC) to problems of instance selection. TC is a recently developed clustering algorithm designed to partition data into many small clusters in linearithmic time (on average). Our proposed clustering method is as follows. First, TC is performed and clusters are reduced into single "prototype" points. Then, TC is applied repeatedly on these prototype points until sufficient data reduction has been obtained. Finally, a more sophisticated clustering algorithm is applied to the reduced prototype points, thereby obtaining a clustering on all $n$ data points. This entire procedure for clustering is called iterative hybridized threshold clustering (IHTC). Through simulation results and by applying our methodology on several real datasets, we show that IHTC combined with $k$-means or HAC substantially reduces the run time and memory usage of the original clustering algorithms while still preserving their performance. Additionally, IHTC helps prevent singular data points from being overfit by clustering algorithms.


Feature-Based Image Clustering and Segmentation Using Wavelets

arXiv.org Machine Learning

Pixel intensity is a widely used feature for clustering and segmentation algorithms, the resulting segmentation using only intensity values might suffer from noises and lack of spatial context information. Wavelet transform is often used for image denoising and classification. We proposed a novel method to incorporate Wavelet features in segmentation and clustering algorithms. The conventional K-means, Fuzzy c-means (FCM), and Active contour without edges (ACWE) algorithms were modified to adapt Wavelet features, leading to robust clustering/segmentation algorithms. A weighting parameter to control the weight of low-frequency sub-band information was also introduced. The new algorithms showed the capability to converge to different segmentation results based on the frequency information derived from the Wavelet sub-bands.


k is the Magic Number -- Inferring the Number of Clusters Through Nonparametric Concentration Inequalities

arXiv.org Machine Learning

Most convex and nonconvex clustering algorithms come with one crucial parameter: the $k$ in $k$-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of $k$-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results. Our \textsc{SpecialK} clustering algorithm automatically determines the appropriate value for $k$, without requiring any data transformation or projection, and without assumptions on the data distribution. Additionally, it is capable to decide that the data consists of only a single cluster, which many existing algorithms cannot.


Locally Private k-Means Clustering

arXiv.org Machine Learning

We design a new algorithm for the Euclidean $k$-means problem that operates in the local model of differential privacy. Unlike in the non-private literature, differentially private algorithms for the $k$-means incur both additive and multiplicative errors. Our algorithm significantly reduces the additive error while keeping the multiplicative error the same as in previous state-of-the-art results. Specifically, on a database of size $n$, our algorithm guarantees $O(1)$ multiplicative error and $\approx n^{1/2+a}$ additive error for an arbitrarily small constant $a$, whereas all previous algorithms in the local model on had additive error $\approx n^{2/3+a}$. We give a simple lower bound showing that additive error of $\approx\sqrt{n}$ is necessary for $k$-means algorithms in the local model (at least for algorithms with a constant number of interaction rounds, which is the setting we consider in this paper).


Clustering of Medical Free-Text Records Based on Word Embeddings

arXiv.org Machine Learning

Is it true that patients with similar conditions get similar diagnoses? In this paper we show NLP methods and a unique corpus of documents to validate this claim. We (1) introduce a method for representation of medical visits based on free-text descriptions recorded by doctors, (2) introduce a new method for clustering of patients' visits and (3) present an~application of the proposed method on a corpus of 100,000 visits. With the proposed method we obtained stable and separated segments of visits which were positively validated against final medical diagnoses. We show how the presented algorithm may be used to aid doctors during their practice.


Learning graph-structured data using Poincar\'e embeddings and Riemannian K-means algorithms

arXiv.org Machine Learning

Recent literature has shown several benefits of hyperbolic embedding of graph-structured data (GSD) in representing their structures and latent relations. While several studies have explored the ability of hyperbolic embedding to represent data (for example, by quantifying their mean average precision) and their ability to produce better visualisations of clusters, only few works exploited the effectiveness of hyperbolic embedding to perform learning on the initial GSD. Motivated by innovative ideas from the fields of Brain computer interfaces and Radar processing, this paper presents a new scheme for learning GSD based on hyperbolic embedding, Riemannian barycentre (i.e. Fr\'echet or geometric mean) and $K$-means algorithms as a significant tool that derives from it. The main idea is as follows. Relying on the Riemannian barycentre, we define a notion of minimal variance which allows us to choose an embedding between different ones. This embedding is used thereafter together with $K$-means algorithms to perform unsupervised clustering and in combination with the nearest neighbour rule to perform supervised learning. We demonstrate the performance of the proposed framework through several experiments on real-world social networks and hierarchical GSD. The obtained results outperform their counterparts in high-dimensional Euclidean spaces and recent proposed geometric approaches.