Goto

Collaborating Authors

 Clustering


Attraction-Repulsion clustering with applications to fairness

arXiv.org Machine Learning

Cluster analysis or clustering is the task of dividing a set of objects in such a way that elements in the same group or cluster are more similar, according to some dissimilarity measure, than elements in different groups. To achieve this task there are two main types of algorithms: partitioning algorithms, which try to split the data into k groups that usually minimize some optimality criteria, or agglomerative algorithms, which start with single observations and merge them into clusters according to some dissimilarity measure. Such methods have been investigated in a large amount of literature, hence we refer to [12] and references therein for an overview. Clustering techniques used as unsupervised classification procedures are increasingly more influential in people's life since they are used in credit scoring, article recommendation, risk assessment, spam filtering or sentencing recommendations in courts of law, among others. Hence controlling the outcome of such procedures, in particular ensuring that some variables which should not be taken into account due to moral or legal issues are not playing a role in the classification of the observations, has become an important field of research known as fair learning. We refer to [15], [3], [1] or [9] for an overview of such legal issues and mathematical solutions to address them. For instance avoiding discrimination against sensitive characteristics such as sex, race or age can not only be achieved using the naive solution of simply ignoring such protected attribute. Indeed, if the the data at hand reflects a real world bias, machine learning algorithms can pick on this behaviour and emulate it. More precisely, suppose we have data that includes information about attributes that we know or suspect that are biased with respect to the protected class.


A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages

arXiv.org Artificial Intelligence

Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a'ciphertext' and seeks a POS tag-tocluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to stateof-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating Figure 1: Overview of our approach to grounded POS it into a true'zero-resource' variant of the tagging. We use an unsupervised clustering method MALOPA(Ammar et al., 2016) dependency (Section 3.2) then reduce and ground the clusters using parser model that removes the current reliance a decipherment approach informed by POS tag sequence on multilingual resources and gold POS tags data from many languages (Section 3.3).


Local Aggregation for Unsupervised Learning of Visual Embeddings

arXiv.org Artificial Intelligence

Unsupervised approaches to learning in neural networks are of substantial interest for furthering artificial intelligence, both because they would enable the training of networks without the need for large numbers of expensive annotations, and because they would be better models of the kind of general-purpose learning deployed by humans. However, unsupervised networks have long lagged behind the performance of their supervised counterparts, especially in the domain of large-scale visual recognition. Recent developments in training deep convolutional embeddings to maximize non-parametric instance separation and clustering objectives have shown promise in closing this gap. Here, we describe a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate. This aggregation metric is dynamic, allowing soft clusters of different scales to emerge. We evaluate our procedure on several large-scale visual recognition datasets, achieving state-of-the-art unsupervised transfer learning performance on object recognition in ImageNet, scene recognition in Places 205, and object detection in PASCAL VOC.


Discovering patterns of online popularity from time series

arXiv.org Machine Learning

How is popularity gained online? Is being successful strictly related to rapidly becoming viral in an online platform or is it possible to acquire popularity in a steady and disciplined fashion? What are other temporal characteristics that can unveil the popularity of online content? To answer these questions, we leverage a multi-faceted temporal analysis of the evolution of popular online contents. Here, we present dipm-SC: a multi-dimensional shape-based time-series clustering algorithm with a heuristic to find the optimal number of clusters. First, we validate the accuracy of our algorithm on synthetic datasets generated from benchmark time series models. Second, we show that dipm-SC can uncover meaningful clusters of popularity behaviors in a real-world Twitter dataset. By clustering the multidimensional time-series of the popularity of contents coupled with other domain-specific dimensions, we uncover two main patterns of popularity: bursty and steady temporal behaviors. Moreover, we find that the way popularity is gained over time has no significant impact on the final cumulative popularity.


Privacy-Preserving Hierarchical Clustering: Formal Security and Efficient Approximation

arXiv.org Artificial Intelligence

Machine Learning (ML) is widely used for predictive tasks in a number of critical applications. Recently, collaborative or federated learning is a new paradigm that enables multiple parties to jointly learn ML models on their combined datasets. Yet, in most application domains, such as healthcare and security analytics, privacy risks limit entities to individually learning local models over the sensitive datasets they own. In this work, we present the first formal study for privacy-preserving collaborative hierarchical clustering, overall featuring scalable cryptographic protocols that allow two parties to privately compute joint clusters on their combined sensitive datasets. First, we provide a formal definition that balances accuracy and privacy, and we present a provably secure protocol along with an optimized version for single linkage clustering. Second, we explore the integration of our protocol with existing approximation algorithms for hierarchical clustering, resulting in a protocol that can efficiently scale to very large datasets. Finally, we provide a prototype implementation and experimentally evaluate the feasibility and efficiency of our approach on synthetic and real datasets, with encouraging results. For example, for a dataset of one million records and 10 dimensions, our optimized privacy-preserving approximation protocol requires 35 seconds for end-to-end execution, just 896KB of communication, and achieves 97.09% accuracy.


A Quantum Annealing-Based Approach to Extreme Clustering

arXiv.org Machine Learning

There is a growing need for algorithms and techniques capable of organizing big data in an accurate and efficient manner. Clustering, or grouping dataset elements based on similarity, can be computationally expensive, especially when employed on massive datasets to divide them into a relatively large number of groups. The task of clustering what can amount to millions (billions) of data points into thousands (millions) of clusters is referred to as $\textit{extreme clustering}$. We have devised a distributed method that can be employed to efficiently solve extreme clustering problems using a quantum annealer.


CRAD: Clustering with Robust Autocuts and Depth

arXiv.org Machine Learning

Abstract--We develop a new density-based clustering algorithmclusters? The performance of CRAD is evaluated through extensive experimental studies. The number of observations in keywords-clustering, space-time processes, data depth cluster 3 is larger than that in clusters 1 and 2. The result of each algorithm is selected by searching the best clustering I. INTRODUCTION Clustering results are shown in Figure 1. Data depth methodology is a widely employed nonparametric Currently available methods such as DBCA, DBSCAN, and tool in multivariate and functional data analysis, with OPTICS, all fail to separate the cluster 1 and 2; in contrast, applications ranging from outlier detection to clustering and our new CRAD algorithm is able to detect both. Depth measures the "centrality" (or for this phenomenon is that both DBSCAN and DBCA use "outlyingness") of a given object with respect to an observed globally-defined parameters (i.e., ษ› and ฮธ, respectively) to data cloud [4], [5].


Cluster Developing 1-Bit Matrix Completion

arXiv.org Machine Learning

Matrix completion has a long-time history of usage as the core technique of recommender systems. In particular, 1-bit matrix completion, which considers the prediction as a ``Recommended'' or ``Not Recommended'' question, has proved its significance and validity in the field. However, while customers and products aggregate into interacted clusters, state-of-the-art model-based 1-bit recommender systems do not take the consideration of grouping bias. To tackle the gap, this paper introduced Group-Specific 1-bit Matrix Completion (GS1MC) by first-time consolidating group-specific effects into 1-bit recommender systems under the low-rank latent variable framework. Additionally, to empower GS1MC even when grouping information is unobtainable, Cluster Developing Matrix Completion (CDMC) was proposed by integrating the sparse subspace clustering technique into GS1MC. Namely, CDMC allows clustering users/items and to leverage their group effects into matrix completion at the same time. Experiments on synthetic and real-world data show that GS1MC outperforms the current 1-bit matrix completion methods. Meanwhile, it is compelling that CDMC can successfully capture items' genre features only based on sparse binary user-item interactive data. Notably, GS1MC provides a new insight to incorporate and evaluate the efficacy of clustering methods while CDMC can be served as a new tool to explore unrevealed social behavior or market phenomenon.


Learning Erd\H{o}s-R\'enyi Graphs under Partial Observations: Concentration or Sparsity?

arXiv.org Machine Learning

This work examines the problem of graph learning over a diffusion network when data can be collected from a limited portion of the network (partial observability). While most works in the literature rely on a degree of sparsity to provide guarantees of consistent graph recovery, our analysis moves away from this condition and includes the demanding setting of dense connectivity. We ascertain that suitable estimators of the combination matrix (i.e., the matrix that quantifies the pairwise interaction between nodes) possess an identifiability gap that enables the discrimination between connected and disconnected nodes. Fundamental conditions are established under which the subgraph of monitored nodes can be recovered, with high probability as the network size increases, through universal clustering algorithms. This claim is proved for three matrix estimators: i) the Granger estimator that adapts to the partial observability setting the solution that is optimal under full observability ; ii) the one-lag correlation matrix; and iii) the residual estimator based on the difference between two consecutive time samples. Comparison among the estimators is performed through illustrative examples that reveal how estimators that are not optimal in the full observability regime can outperform the Granger estimator in the partial observability regime. The analysis reveals that the fundamental property enabling consistent graph learning is the statistical concentration of node degrees, rather than the sparsity of connections.


Simultaneous Dimensionality and Complexity Model Selection for Spectral Graph Clustering

arXiv.org Machine Learning

Our problem of interest is to cluster vertices of a graph by identifying its underlying community structure. Among various vertex clustering approaches, spectral clustering is one of the most popular methods, because it is easy to implement while often outperforming traditional clustering algorithms. However, there are two inherent model selection problems in spectral clustering, namely estimating the embedding dimension and number of clusters. This paper attempts to address the issue by establishing a novel model selection framework specifically for vertex clustering on graphs under a stochastic block model. The first contribution is a probabilistic model which approximates the distribution of the extended spectral embedding of a graph. The model is constructed based on a theoretical result of asymptotic normality of the informative part of the embedding, and on a simulation result of limiting behavior of the redundant part of the embedding. The second contribution is a simultaneous model selection framework. In contrast with the traditional approaches, our model selection procedure estimates embedding dimension and number of clusters simultaneously. Based on our proposed distributional model, a theorem on the consistency of the estimates of model parameters is stated and proven. The theorem provides a statistical support for the validity of our method. Heuristic algorithms via the simultaneous model selection framework for vertex clustering are proposed, with good performance shown in the experiment on synthetic data and on the real application of connectome analysis.