Goto

Collaborating Authors

 Clustering


GPU-accelerated Faster Mean Shift with euclidean distance metrics

arXiv.org Artificial Intelligence

Handling clustering problems are important in data statistics, pattern recognition and image processing. The mean-shift algorithm, a common unsupervised algorithms, is widely used to solve clustering problems. However, the mean-shift algorithm is restricted by its huge computational resource cost. In previous research[10], we proposed a novel GPU-accelerated Faster Mean-shift algorithm, which greatly speed up the cosine-embedding clustering problem. In this study, we extend and improve the previous algorithm to handle Euclidean distance metrics. Different from conventional GPU-based mean-shift algorithms, our algorithm adopts novel Seed Selection & Early Stopping approaches, which greatly increase computing speed and reduce GPU memory consumption. In the simulation testing, when processing a 200K points clustering problem, our algorithm achieved around 3 times speedup compared to the state-of-the-art GPU-based mean-shift algorithms with optimized GPU memory consumption. Moreover, in this study, we implemented a plug-and-play model for faster mean-shift algorithm, which can be easily deployed. (Plug-and-play model is available: https://github.com/masqm/Faster-Mean-Shift-Euc)


Unsupervised Clustering Active Learning for Person Re-identification

arXiv.org Artificial Intelligence

Supervised person re-identification (re-id) approaches require a large amount of pairwise manual labeled data, which is not applicable in most real-world scenarios for re-id deployment. On the other hand, unsupervised re-id methods rely on unlabeled data to train models but performs poorly compared with supervised re-id methods. In this work, we aim to combine unsupervised re-id learning with a small number of human annotations to achieve a competitive performance. Towards this goal, we present a Unsupervised Clustering Active Learning (UCAL) re-id deep learning approach. It is capable of incrementally discovering the representative centroid-pairs and requiring human annotate them. These few labeled representative pairwise data can improve the unsupervised representation learning model with other large amounts of unlabeled data. More importantly, because the representative centroid-pairs are selected for annotation, UCAL can work with very low-cost human effort. Extensive experiments demonstrate the superiority of the proposed model over state-of-the-art active learning methods on three re-id benchmark datasets.


Application of Markov Structure of Genomes to Outlier Identification and Read Classification

arXiv.org Machine Learning

That the sequential structure of genomes is important has been known since the discovery of DNA. In this paper we employ a statistics and stochastic process perspective on triplets of successive bases to address two important applications: identifying outliers in genome databases, and classifying reads in the metagenomic context of reference-guided assembly. From this stochastic process perspective, triplets are a second-order Markov chain specified by the distribution of each base conditional on its two immediate predecessors. To be sure, studying genomes via base sequence distributions is not novel. Previous papers have addressed genome signatures (Karlin et al., 1997; Campbell et al., 1999; Takashi et al., 2003), as well as frequentist (Rosen et al., 2008) and Bayesian (Wang et al., 2007) approaches to classification problems.


On the Unreasonable Efficiency of State Space Clustering in Personalization Tasks

arXiv.org Artificial Intelligence

In this effort we consider a reinforcement learning (RL) technique for solving personalization tasks with complex reward signals. In particular, our approach is based on state space clustering with the use of a simplistic $k$-means algorithm as well as conventional choices of the network architectures and optimization algorithms. Numerical examples demonstrate the efficiency of different RL procedures and are used to illustrate that this technique accelerates the agent's ability to learn and does not restrict the agent's performance.


DBScan Clustering Algorithm

#artificialintelligence

Clustering is an important topic in busyness, because it helps us to reduce the number of features to some typology, to some clusters which, in a case that data allows us, can give us more informations about our topic of interest. In a data science literature it is usually presented as dimension reduction technique, but in science, or even in data science it could reveal some additional pattern in data that is not obvious at the first glance. Imagine you have some features about some students: their marks, their personality traits, their ability scores, their motivation. Clustering could reveal you the completely new types of (un)successful students (it could be someone with high ability and low motivation -- underachiever, but at the same time it could be someone with high motivation and really good marks, but low abilities -- overachiever). This could simply done by clustering, while our cluster names (overachiever, underachiever) are basically interpretations of the clusters.


Optimal Variable Clustering for High-Dimensional Matrix Valued Data

arXiv.org Machine Learning

Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal in terms of the magnitude of some cluster separation metric. The practical implementation of our algorithm with the optimal weight is also discussed. Finally, we conduct simulation studies to evaluate the finite sample performance of our algorithm and apply the method to a genomic dataset.


HuBERT Explained

#artificialintelligence

The HuBERT model architecture follows the wav2vec 2.0 architecture consisting of: The number of each of these components varies between the base, large and x-large variations. Each component and its task will be better explained while explaining the training loop. The first training step consists of discovering the hidden units, and the process begins with extracting MFCCs(Mel frequency cepstrum) from the audio waveform. These are raw acoustic features useful for representing speech. Each segment of audio is then passed to the K-means clustering algorithm, and assigned to one of K clusters.


Graph-based Ensemble Machine Learning for Student Performance Prediction

arXiv.org Artificial Intelligence

Student performance prediction is a critical research problem to understand the students' needs, present proper learning opportunities/resources, and develop the teaching quality. However, traditional machine learning methods fail to produce stable and accurate prediction results. In this paper, we propose a graph-based ensemble machine learning method that aims to improve the stability of single machine learning methods via the consensus of multiple methods. To be specific, we leverage both supervised prediction methods and unsupervised clustering methods, build an iterative approach that propagates in a bipartite graph as well as converges to more stable and accurate prediction results. Extensive experiments demonstrate the effectiveness of our proposed method in predicting more accurate student performance. Specifically, our model outperforms the best traditional machine learning algorithms by up to 14.8% in prediction accuracy.


K-means clustering

#artificialintelligence

The topic I will try to explain today is K-means clustering. First, you might be wondering what the "K" means. K is a parameter that corresponds to the number of clusters you are trying to detect. For example, in order to detect 3 clusters like on the image on top, you would need to use K 3. But what does it mean?


Learning Spatio-Temporal Specifications for Dynamical Systems

arXiv.org Artificial Intelligence

Learning dynamical systems properties from data provides important insights that help us understand such systems and mitigate undesired outcomes. In this work, we propose a framework for learning spatio-temporal (ST) properties as formal logic specifications from data. We introduce SVM-STL, an extension of Signal Signal Temporal Logic (STL), capable of specifying spatial and temporal properties of a wide range of dynamical systems that exhibit time-varying spatial patterns. Our framework utilizes machine learning techniques to learn SVM-STL specifications from system executions given by sequences of spatial patterns. We present methods to deal with both labeled and unlabeled data. In addition, given system requirements in the form of SVM-STL specifications, we provide an approach for parameter synthesis to find parameters that maximize the satisfaction of such specifications. Our learning framework and parameter synthesis approach are showcased in an example of a reaction-diffusion system.