AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Clustering with Same-Cluster Queries

Neural Information Processing SystemsMar-12-2024, 14:43:55 GMT

We propose a framework for Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not. We study the query and computational complexity of clustering in this framework. We consider a setting where the expert conforms to a center-based clustering with a notion of margin. We show that there is a trade off between computational complexity and query complexity; We prove that for the case of k-means clustering (i.e., when the expert conforms to a solution of k-means), having access to relatively few such queries allows efficient solutions to otherwise NP hard problems.

algorithm, query, same-cluster query, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > New York (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Community Detection on Evolving Graphs

Neural Information Processing SystemsMar-12-2024, 14:00:38 GMT

Clustering is a fundamental step in many information-retrieval and data-mining applications. Detecting clusters in graphs is also a key tool for finding the community structure in social and behavioral networks. In many of these applications, the input graph evolves over time in a continual and decentralized manner, and, to maintain a good clustering, the clustering algorithm needs to repeatedly probe the graph. Furthermore, there are often limitations on the frequency of such probes, either imposed explicitly by the online platform (e.g., in the case of crawling proprietary social networks like twitter) or implicitly because of resource limitations (e.g., in the case of crawling the web). In this paper, we study a model of clustering on evolving graphs that captures this aspect of the problem. Our model is based on the classical stochastic block model, which has been used to assess rigorously the quality of various static clustering methods. In our model, the algorithm is supposed to reconstruct the planted clustering, given the ability to query for small pieces of local information about the graph, at a limited rate. We design and analyze clustering algorithms that work in this model, and show asymptotically tight upper and lower bounds on their accuracy. Finally, we perform simulations, which demonstrate that our main asymptotic results hold true also in practice.

algorithm, node, query, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > District of Columbia > Washington (0.04)
(6 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Crowdsourced Clustering: Querying Edges vs Triangles Babak Hassibi Department of Electrical Engineering Department of Electrical Engineering Caltech, Pasadena

Neural Information Processing SystemsMar-12-2024, 13:45:28 GMT

We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.

edge query, query, triangle query, (14 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
North America > United States > California (0.04)
(2 more...)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Mixed Linear Regression with Multiple Components Prateek Jain 2 University of Texas at Austin

Neural Information Processing SystemsMar-12-2024, 13:44:55 GMT

In this paper, we study the mixed linear regression (MLR) problem, where the goal is to recover multiple underlying linear models from their unlabeled linear measurements. We propose a non-convex objective function which we show is locally strongly convex in the neighborhood of the ground truth. We use a tensor method for initialization so that the initial models are in the local strong convexity region. We then employ general convex optimization algorithms to minimize the objective function. To the best of our knowledge, our approach provides first exact recovery guarantees for the MLR problem with K 2 components. Moreover, our method has near-optimal computational complexity O(Nd) e as well as near-optimal sample complexity O(d) e for constant K. Furthermore, we show that our nonconvex formulation can be extended to solving the subspace clustering problem as well. In particular, when initialized within a small constant distance to the true subspaces, our method converges to the global optima (and recovers true subspaces) in time linear in the number of points. Furthermore, our empirical results indicate that even with random initialization, our approach converges to the global optima in linear time, providing speed-up of up to two orders of magnitude.

algorithm, complexity, subspace, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.40)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > India (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Clustering Signed Networks with the Geometric Mean of Laplacians Pedro Mercado and Matthias Hein University of Padua, Padua, Italy

Neural Information Processing SystemsMar-12-2024, 13:14:04 GMT

Signed networks allow to model positive and negative relationships. We analyze existing extensions of spectral clustering to signed networks. It turns out that existing approaches do not recover the ground truth clustering in several situations where either the positive or the negative network structures contain no noise. Our analysis shows that these problems arise as existing approaches take some form of arithmetic mean of the Laplacians of the positive and negative part. As a solution we propose to use the geometric mean of the Laplacians of positive and negative part and show that it outperforms the existing approaches. While the geometric mean of matrices is computationally expensive, we show that eigenvectors of the geometric mean can be computed efficiently, leading to a numerical scheme for sparse matrices which is of independent interest.

eigenvalue, graph, laplacian, (16 more...)

Neural Information Processing Systems

Country:

Europe > Italy (0.40)
North America > United States > Michigan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Communication Optimal Distributed Clustering

Neural Information Processing SystemsMar-12-2024, 12:46:20 GMT

Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to spectrally cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n s, while in the broadcast model it is n + s. A similar phenomenon holds for the geometric setting as well. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.

algorithm, blackboard model, communication, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Indiana > Monroe County > Bloomington (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
Europe > United Kingdom (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Learning Sparse Gaussian Graphical Models with Overlapping Blocks Su-In Lee Department of Computer Science & Engineering, University of Washington, Seattle

Neural Information Processing SystemsMar-12-2024, 12:29:53 GMT

We present a novel framework, called GRAB (GRaphical models with overlApping Blocks), to capture densely connected components in a network estimate. GRAB takes as input a data matrix of p variables and n samples and jointly learns both a network of the p variables and densely connected groups of variables (called'blocks'). GRAB has four major novelties as compared to existing network estimation methods: 1) It does not require blocks to be given a priori.

constraint, matrix, regularization parameter, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington > King County > Seattle (0.40)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Hematology (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Fast learning rates with heavy-tailed losses Lam Si Tung Ho2 Duy Nguyen

Neural Information Processing SystemsMar-12-2024, 12:00:46 GMT

We study fast learning rates when the losses are not necessarily bounded and may have a distribution with heavy tails.

assumption, bernstein, multi-scale bernstein, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Vietnam (0.04)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Hierarchical Clustering via Spreading Metrics Aurko Roy

Neural Information Processing SystemsMar-12-2024, 11:00:53 GMT

We study the cost function for hierarchical clusterings introduced by [16] where hierarchies are treated as first-class objects rather than deriving their cost from projections into flat clusters.

algorithm, algorithm 1, cost function, (13 more...)

Neural Information Processing Systems

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.05)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Constant-Factor Bi-Criteria Approximation Guarantee for k-means + +

Neural Information Processing SystemsMar-12-2024, 09:47:14 GMT

This result extends the previously known O(log k) guarantee for the case β = 1 to the constant-factor bi-criteria regime. It also improves upon an existing constant-factor bi-criteria result that holds only with constant probability.

algorithm, cluster center, lemma 3, (15 more...)

Neural Information Processing Systems

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback