true cluster
CollaborativeCausalDiscovery withAtomicInterventions
Asinterventions areexpensive(require carefully controlled experiments) andperforming multiple interventions is time-consuming, an important goal in causal discovery is to design algorithms that utilize simple (preferably, single variable) and fewer interventions [Shanmugam et al.,2015]. However, when there are latents or unobserved variables in the system, in the worst-case, it is not possible to learn the exact causal DAG without intervening on every variable at least once.
Optimal Cluster Recovery in the Labeled Stochastic Block Model
We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\ell$ appears with probability $p(i,j,\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The objective is to reconstruct the clusters from the observation of these random labels. Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters. We find the set of parameters such that there exists a clustering algorithm with at most $s$ misclassified items in average under the general LSBM and for any $s=o(n)$, which solves one open problem raised in \cite{abbe2015community}. We further develop an algorithm, based on simple spectral methods, that achieves this fundamental performance limit within $O(n \mbox{polylog}(n))$ computations and without the a-priori knowledge of the model parameters.
A Geometric Approach to $k$-means
Hong, Jiazhen, Qian, Wei, Chen, Yudong, Zhang, Yuqian
\kmeans clustering is a fundamental problem in many scientific and engineering domains. The optimization problem associated with \kmeans clustering is nonconvex, for which standard algorithms are only guaranteed to find a local optimum. Leveraging the hidden structure of local solutions, we propose a general algorithmic framework for escaping undesirable local solutions and recovering the global solution or the ground truth clustering. This framework consists of iteratively alternating between two steps: (i) detect mis-specified clusters in a local solution, and (ii) improve the local solution by non-local operations. We discuss specific implementation of these steps, and elucidate how the proposed framework unifies many existing variants of \kmeans algorithms through a geometric perspective. We also present two natural variants of the proposed framework, where the initial number of clusters may be over- or under-specified. We provide theoretical justifications and extensive experiments to demonstrate the efficacy of the proposed approach.
Distributed clustering in partially overlapping feature spaces
Maritan, Alessio, Schenato, Luca
We introduce and address a novel distributed clustering problem where each participant has a private dataset containing only a subset of all available features, and some features are included in multiple datasets. This scenario occurs in many real-world applications, such as in healthcare, where different institutions have complementary data on similar patients. We propose two different algorithms suitable for solving distributed clustering problems that exhibit this type of feature space heterogeneity. The first is a federated algorithm in which participants collaboratively update a set of global centroids. The second is a one-shot algorithm in which participants share a statistical parametrization of their local clusters with the central server, who generates and merges synthetic proxy datasets. In both cases, participants perform local clustering using algorithms of their choice, which provides flexibility and personalized computational costs. Pretending that local datasets result from splitting and masking an initial centralized dataset, we identify some conditions under which the proposed algorithms are expected to converge to the optimal centralized solution. Finally, we test the practical performance of the algorithms on three public datasets.
Exponentially Consistent Nonparametric Clustering of Data Streams
Singh, Bhupender, Rajagopalan, Ananth Ram, Bhashyam, Srikrishna
In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data streams generated from unknown distributions. The distributions of the $M$ data streams belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Our results show that SLINK is exponentially consistent for a larger class of problems than $k$-medoids distribution clustering. We also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.