AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

$k$-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy

Fan, Chenglin, Li, Ping, Li, Xiaoyun

arXiv.org Machine LearningJul-8-2022

When designing clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. In this paper, we develop a new initialization scheme, called HST initialization, for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. Our proposed HST initialization can produce initial centers achieving lower errors than those from another popular initialization method, $k$-median++, with comparable efficiency. The HST initialization can also be extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error from applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments justify the theory and demonstrate the effectiveness of our proposed method. Our approach can also be extended to the $k$-means problem.

artificial intelligence, initialization, machine learning, (18 more...)

arXiv.org Machine Learning

2206.12895

Country:

North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(18 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Mitigating shortage of labeled data using clustering-based active learning with diversity exploration

Yan, Xuyang, Nazmi, Shabnam, Gebru, Biniam, Anwar, Mohd, Homaifar, Abdollah, Sarkar, Mrinmoy, Gupta, Kishor Datta

arXiv.org Artificial IntelligenceJul-6-2022

In this paper, we proposed a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), to address the shortage of labeled data. ALCS employs a density-based clustering approach to explore the cluster structure from the data without requiring exhaustive parameter tuning. A bi-cluster boundary-based sample query procedure is introduced to improve the learning performance for classifying highly overlapped classes. Additionally, we developed an effective diversity exploration strategy to address the redundancy among queried samples.

active learning, clustering-based al method, learning, (12 more...)

arXiv.org Artificial Intelligence

2207.02964

Country: North America > United States > North Carolina (0.06)

Genre: Research Report (0.64)

Industry: Government (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Local Sample-weighted Multiple Kernel Clustering with Consensus Discriminative Graph

Li, Liang, Wang, Siwei, Liu, Xinwang, Zhu, En, Shen, Li, Li, Kenli, Li, Keqin

arXiv.org Artificial IntelligenceJul-5-2022

Multiple kernel clustering (MKC) is committed to achieving optimal information fusion from a set of base kernels. Constructing precise and local kernel matrices is proved to be of vital significance in applications since the unreliable distant-distance similarity estimation would degrade clustering per-formance. Although existing localized MKC algorithms exhibit improved performance compared to globally-designed competi-tors, most of them widely adopt KNN mechanism to localize kernel matrix by accounting for {\tau} -nearest neighbors. However, such a coarse manner follows an unreasonable strategy that the ranking importance of different neighbors is equal, which is impractical in applications. To alleviate such problems, this paper proposes a novel local sample-weighted multiple kernel clustering (LSWMKC) model. We first construct a consensus discriminative affinity graph in kernel space, revealing the latent local structures. Further, an optimal neighborhood kernel for the learned affinity graph is output with naturally sparse property and clear block diagonal structure. Moreover, LSWMKC im-plicitly optimizes adaptive weights on different neighbors with corresponding samples. Experimental results demonstrate that our LSWMKC possesses better local manifold representation and outperforms existing kernel or graph-based clustering algo-rithms. The source code of LSWMKC can be publicly accessed from https://github.com/liliangnudt/LSWMKC.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2207.02846

Country:

North America > United States > Missouri > Jackson County > Kansas City (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
(15 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

8 Ways You Can 'Level Up' Your Machine Learning Projects

#artificialintelligenceJul-2-2022, 14:35:55 GMT

Need to classify data or predict outcomes? Are you struggling with your machine learning (Machine Learning) project? There are various techniques that can improve the situation. Some of the eight methods discussed below will dramatically accelerate the Machine Learning process, and others will not only accelerate the process, but will also help you build better models. Not all of these techniques will be suitable for a particular project.

data analysis, dataset, learning, (13 more...)

#artificialintelligence

Industry: Education (0.33)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.31)

Add feedback

Local manifold learning and its link to domain-based physics knowledge

Zdybał, Kamila, D'Alessio, Giuseppe, Attili, Antonio, Coussement, Axel, Sutherland, James C., Parente, Alessandro

arXiv.org Machine LearningJul-1-2022

In many reacting flow systems, the thermo-chemical state-space is known or assumed to evolve close to a low-dimensional manifold (LDM). Various approaches are available to obtain those manifolds and subsequently express the original high-dimensional space with fewer parameterizing variables. Principal component analysis (PCA) is one of the dimensionality reduction methods that can be used to obtain LDMs. PCA does not make prior assumptions about the parameterizing variables and retrieves them empirically from the training data. In this paper, we show that PCA applied in local clusters of data (local PCA) is capable of detecting the intrinsic parameterization of the thermo-chemical state-space. We first demonstrate that utilizing three common combustion models of varying complexity: the Burke-Schumann model, the chemical equilibrium model and the homogeneous reactor. Parameterization of these models is known a priori which allows for benchmarking with the local PCA approach. We further extend the application of local PCA to a more challenging case of a turbulent non-premixed $n$-heptane/air jet flame for which the parameterization is no longer obvious. Our results suggest that meaningful parameterization can be obtained also for more complex datasets. We show that local PCA finds variables that can be linked to local stoichiometry, reaction progress and soot formation processes.

artificial intelligence, machine learning, null, (15 more...)

arXiv.org Machine Learning

2207.00275

Country:

North America > United States (0.46)
Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Energy > Oil & Gas (0.68)
Materials > Chemicals (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

K-ARMA Models for Clustering Time Series Data

Hoare, Derek O., Matteson, David S., Wells, Martin T.

arXiv.org Machine LearningJun-30-2022

We present an approach to clustering time series data using a model-based generalization of the K-Means algorithm which we call K-Models. We prove the convergence of this general algorithm and relate it to the hard-EM algorithm for mixture modeling. We then apply our method first with an AR($p$) clustering example and show how the clustering algorithm can be made robust to outliers using a least-absolute deviations criteria. We then build our clustering algorithm up for ARMA($p,q$) models and extend this to ARIMA($p,d,q$) models. We develop a goodness of fit statistic for the models fitted to clusters based on the Ljung-Box statistic. We perform experiments with simulated data to show how the algorithm can be used for outlier detection, detecting distributional drift, and discuss the impact of initialization method on empty clusters. We also perform experiments on real data which show that our method is competitive with other existing methods for similar time series clustering tasks.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2207.00039

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.25)
North America > Cuba (0.04)
North America > United States > Tennessee (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Machine Learning Algorithms Cheat Sheet

#artificialintelligenceJun-24-2022, 04:31:13 GMT

Machine learning is a subfield of artificial intelligence (AI) and computer science that focuses on using data and algorithms to mimic the way people learn, progressively improving its accuracy. This way, Machine Learning is one of the most interesting methods in Computer Science these days, and it's being applied behind the scenes in products and services we consume in everyday life. In case you want to know what Machine Learning algorithms are used in different applications, or if you are a developer and you're looking for a method to use for a problem you are trying to solve, keep reading below and use these steps as a guide. Machine Learning can be divided into three different types of learning: Unsupervised Learning, Supervised Learning, and Semi-supervised Learning. Unsupervised learning uses information data that is not labeled, that way the machine should work with no guidance according to patterns, similarities, and differences. On the other hand, supervised learning has a presence of a "teacher", who is in charge of training the machine by labeling the data to work with. Next, the machine receives some examples that allow it to produce a correct outcome.

accuracy, algorithm, neural network, (15 more...)

#artificialintelligence

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.71)

Add feedback

Wasserstein t-SNE

Bachmann, Fynn, Hennig, Philipp, Kobak, Dmitry

arXiv.org Machine LearningJun-23-2022

Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units.

artificial intelligence, machine learning, wasserstein distance, (17 more...)

arXiv.org Machine Learning

2205.07531

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > Netherlands > South Holland > Leiden (0.05)
North America > United States > Wisconsin (0.04)
Europe > Germany > Hamburg (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Government > Voting & Elections (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Learning to Purification for Unsupervised Person Re-identification

Lan, Long, Teng, Xiao, Zhang, Jing, Zhang, Xiang, Tao, Dacheng

arXiv.org Artificial IntelligenceJun-22-2022

Unsupervised person re-identification is a challenging and promising task in computer vision. Nowadays unsupervised person re-identification methods have achieved great progress by training with pseudo labels. However, how to purify feature and label noise is less explicitly studied in the unsupervised manner. To purify the feature, we take into account two types of additional features from different local views to enrich the feature representation. The proposed multi-view features are carefully integrated into our cluster contrast learning to leverage more discriminative cues that the global feature easily ignored and biased. To purify the label noise, we propose to take advantage of the knowledge of teacher model in an offline scheme. Specifically, we first train a teacher model from noisy pseudo labels, and then use the teacher model to guide the learning of our student model. In our setting, the student model could converge fast with the supervision of the teacher model thus reduce the interference of noisy labels as the teacher model greatly suffered. After carefully handling the noise and bias in the feature learning, our purification modules are proven to be very effective for unsupervised person re-identification. Extensive experiments on three popular person re-identification datasets demonstrate the superiority of our method. Especially, our approach achieves a state-of-the-art accuracy 85.8\% @mAP and 94.5\% @Rank-1 on the challenging Market-1501 benchmark with ResNet-50 under the fully unsupervised setting. The code will be released.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TIP.2023.3278860

2204.09931

Country: North America > United States (0.04)

Genre: Research Report (1.00)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Bregman Power k-Means for Clustering Exponential Family Data

Vellal, Adithya, Chakraborty, Saptarshi, Xu, Jason

arXiv.org Machine LearningJun-22-2022

Recent progress in center-based clustering algorithms combats poor local minima by implicit annealing, using a family of generalized means. These methods are variations of Lloyd's celebrated $k$-means algorithm, and are most appropriate for spherical clusters such as those arising from Gaussian data. In this paper, we bridge these algorithmic advances to classical work on hard clustering under Bregman divergences, which enjoy a bijection to exponential family distributions and are thus well-suited for clustering objects arising from a breadth of data generating mechanisms. The elegant properties of Bregman divergences allow us to maintain closed form updates in a simple and transparent algorithm, and moreover lead to new theoretical arguments for establishing finite sample bounds that relax the bounded support assumption made in the existing state of the art. Additionally, we consider thorough empirical analyses on simulated experiments and a case study on rainfall data, finding that the proposed method outperforms existing peer methods in a variety of non-Gaussian data settings.

artificial intelligence, bregman power, machine learning, (15 more...)

arXiv.org Machine Learning

2206.1086

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)
North America > United States > Maryland > Baltimore (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback