AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Community Detection via Measure Space Embedding Mark Kozdoba

Neural Information Processing SystemsMar-13-2024, 01:44:46 GMT

We present a new algorithm for community detection. The algorithm uses random walks to embed the graph in a space of measures, after which a modification of k-means in that space is applied. The algorithm is therefore fast and easily parallelizable. We evaluate the algorithm on standard random graph benchmarks, including some overlapping community benchmarks, and find its performance to be better or at least as good as previously known algorithms.

algorithm, graph, partition, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.28)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Fast Distributed k-Center Clustering with Outliers on Massive Data

Neural Information Processing SystemsMar-13-2024, 01:29:55 GMT

Clustering large data is a fundamental problem with a vast number of applications. Due to the increasing size of data, practitioners interested in clustering have turned to distributed computation methods. In this work, we consider the widely used k-center clustering problem and its variant used to handle noisy data, k-center with outliers. In the noise-free setting we demonstrate how a previously-proposed distributed method is actually an O(1)-approximation algorithm, which accurately explains its strong empirical performance. Additionally, in the noisy setting, we develop a novel distributed algorithm that is also an O(1)-approximation. These algorithms are highly parallel and lend themselves to virtually any distributed computing framework. We compare each empirically against the best known sequential clustering methods and show that both distributed algorithms are consistently close to their sequential versions. The algorithms are all one can hope for in distributed settings: they are fast, memory efficient and they match their sequential counterparts.

algorithm, k-center problem, outlier, (16 more...)

Neural Information Processing Systems

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Missouri > St. Louis County > St. Louis (0.04)
North America > United States > New York > Tompkins County > Ithaca (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Streaming Min-Max Hypergraph Partitioning

Neural Information Processing SystemsMar-13-2024, 01:00:12 GMT

In many applications, the data is of rich structure that can be represented by a hypergraph, where the data items are represented by vertices and the associations among items are represented by hyperedges. Equivalently, we are given an input bipartite graph with two types of vertices: items, and associations (which we refer to as topics). We consider the problem of partitioning the set of items into a given number of components such that the maximum number of topics covered by a component is minimized. This is a clustering problem with various applications, e.g.

algorithm, graph, partition, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction

Neural Information Processing SystemsMar-13-2024, 00:58:49 GMT

We present the Mind the Gap Model (MGM), an approach for interpretable feature extraction and selection. By placing interpretability criteria directly into the model, we allow for the model to both optimize parameters related to interpretability and to directly report a global set of distinguishable dimensions to assist with further data exploration and hypothesis generation.

dimension, mgm, selection, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Asia > Middle East > Jordan (0.05)
North America > United States > Pennsylvania (0.04)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Log Summarisation for Defect Evolution Analysis

Dolga, Rares, Zmigrod, Ran, Silva, Rui, Alamir, Salwa, Shah, Sameena

arXiv.org Artificial IntelligenceMar-13-2024

Log analysis and monitoring are essential aspects in software maintenance and identifying defects. In particular, the temporal nature and vast size of log data leads to an interesting and important research question: How can logs be summarised and monitored over time? While this has been a fundamental topic of research in the software engineering community, work has typically focused on heuristic-, syntax-, or static-based methods. In this work, we suggest an online semantic-based clustering approach to error logs that dynamically updates the log clusters to enable monitoring code error life-cycles. We also introduce a novel metric to evaluate the performance of temporal log clusters. We test our system and evaluation metric with an industrial dataset and find that our solution outperforms similar systems. We hope that our work encourages further temporal exploration in defect datasets.

algorithm, dataset, representation, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3617572.3617881

2403.08358

Country:

North America > United States > California > San Francisco County > San Francisco (0.16)
North America > United States > New York > New York County > New York City (0.05)
Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
(5 more...)

Genre: Research Report (0.89)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.95)
Information Technology > Security & Privacy (0.94)

Add feedback

Taming Cross-Domain Representation Variance in Federated Prototype Learning with Heterogeneous Data Domains

Wang, Lei, Bian, Jieming, Zhang, Letian, Chen, Chen, Xu, Jie

arXiv.org Artificial IntelligenceMar-13-2024

Federated learning (FL) allows collaborative machine learning training without sharing private data. While most FL methods assume identical data domains across clients, real-world scenarios often involve heterogeneous data domains. Federated Prototype Learning (FedPL) addresses this issue, using mean feature vectors as prototypes to enhance model generalization. However, existing FedPL methods create the same number of prototypes for each client, leading to cross-domain performance gaps and disparities for clients with varied data distributions. To mitigate cross-domain feature representation variance, we introduce FedPLVM, which establishes variance-aware dual-level prototypes clustering and employs a novel $\alpha$-sparsity prototype loss. The dual-level prototypes clustering strategy creates local clustered prototypes based on private data features, then performs global prototypes clustering to reduce communication complexity and preserve local data privacy. The $\alpha$-sparsity prototype loss aligns samples from underrepresented domains, enhancing intra-class similarity and reducing inter-class similarity. Evaluations on Digit-5, Office-10, and DomainNet datasets demonstrate our method's superiority over existing approaches.

dataset, learning, prototype, (15 more...)

arXiv.org Artificial Intelligence

2403.09048

Country:

North America > United States > Tennessee (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > Promising Solution (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Kernel Alignment for Unsupervised Feature Selection via Matrix Factorization

Lin, Ziyuan, Needell, Deanna

arXiv.org Artificial IntelligenceMar-13-2024

By removing irrelevant and redundant features, feature selection aims to find a good representation of the original features. With the prevalence of unlabeled data, unsupervised feature selection has been proven effective in alleviating the so-called curse of dimensionality. Most existing matrix factorization-based unsupervised feature selection methods are built upon subspace learning, but they have limitations in capturing nonlinear structural information among features. It is well-known that kernel techniques can capture nonlinear structural information. In this paper, we construct a model by integrating kernel functions and kernel alignment, which can be equivalently characterized as a matrix factorization problem. However, such an extension raises another issue: the algorithm performance heavily depends on the choice of kernel, which is often unknown a priori. Therefore, we further propose a multiple kernel-based learning method. By doing so, our model can learn both linear and nonlinear similarity information and automatically generate the most appropriate kernel. Experimental analysis on real-world data demonstrates that the two proposed methods outperform other classic and state-of-the-art unsupervised feature selection methods in terms of clustering results and redundancy reduction in almost all datasets tested.

algorithm, feature selection, selection, (12 more...)

arXiv.org Artificial Intelligence

2403.14688

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Compressive spectral embedding: sidestepping the SVD

Neural Information Processing SystemsMar-12-2024, 23:29:37 GMT

Spectral embedding based on the Singular Value Decomposition (SVD) is a widely used "preprocessing" step in many learning tasks, typically leading to dimensionality reduction by projecting onto a number of dominant singular vectors and rescaling the coordinate axes (by a predefined function of the singular value). However, the number of such vectors required to capture problem structure grows with problem size, and even partial SVD computation becomes a bottleneck. In this paper, we propose a low-complexity compressive spectral embedding algorithm, which employs random projections and finite order polynomial expansions to compute approximations to SVD-based embedding. For an m n matrix with T non-zeros, its time complexity is O((T +m+n)log(m+n)), and the embedding dimension is O(log(m + n)), both of which are independent of the number of singular vectors whose effect we wish to capture. To the best of our knowledge, this is the first work to circumvent this dependence on the number of singular vectors for general SVD-based embeddings.

algorithm, approximation, matrix, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Weighted Theta Functions and Embeddings with Applications to Max-Cut, Clustering and Summarization

Neural Information Processing SystemsMar-12-2024, 23:15:04 GMT

We introduce a unifying generalization of the Lovász theta function, and the associated geometric embedding, for graphs with weights on both nodes and edges. We show how it can be computed exactly by semidefinite programming, and how to approximate it using SVM computations. We show how the theta function can be interpreted as a measure of diversity in graphs and use this idea, and the graph embedding in algorithms for Max-Cut, correlation clustering and document summarization, all of which are well represented as problems on weighted graphs.

characterization, graph, theta function, (17 more...)

Neural Information Processing Systems

Country:

Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback

Tree-Guided MCMC Inference for Normalized Random Measure Mixture Models

Neural Information Processing SystemsMar-12-2024, 22:30:08 GMT

Normalized random measures (NRMs) provide a broad class of discrete random measures that are often used as priors for Bayesian nonparametric models. Dirichlet process is a well-known example of NRMs. Most of posterior inference methods for NRM mixture models rely on MCMC methods since they are easy to implement and their convergence is well studied. However, MCMC often suffers from slow convergence when the acceptance rate is low. Tree-based inference is an alternative deterministic posterior inference method, where Bayesian hierarchical clustering (BHC) or incremental Bayesian hierarchical clustering (IBHC) have been developed for DP or NRM mixture (NRMM) models, respectively.

algorithm, sampler, tgmcmc, (17 more...)

Neural Information Processing Systems

Country:

Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
Europe > Iceland > Capital Region > Reykjavik (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.35)

Add feedback