Goto

Collaborating Authors

 Clustering


Incorporating Multiple Cluster Centers for Multi-Label Learning

arXiv.org Machine Learning

Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Due to its ability to cope with the real-world objects with multiple semantic meanings, multi-label learning has been successfully applied in various application domains [1], such as tag recommendation [2, 3], bioinformatics [4, 5, 6], information retrieval [7, 8], rule mining [9, 10], web mining [11, 12], and so on. Formally speaking, suppose the given multi-label data set is denoted by D {x i, y i } n i 1 where x i R d is a feature vector with d dimensions (features) and y i { 1, 1} q is the corresponding label vector with the size of label space being q. Here, y ij 1 indicates that the i-th instance x i has the j-th label (or equivalently, the j-th label is a relevant label of x i), otherwise the j-th label is an irrelevant label of x i . Let X R d be the d-dimensional feature space, and Y { 1, 1} q be the q-dimensional label space, multi-label learning aims to induce a mapping function f: X Y, which is able to correctly predict the label vector of unseen instances. To solve the multi-label learning problem, the most straightforward solution is Binary Relevance (BR) [13, 14], which aims to decompose the original learning problem into a set of independent binary classification problems. However, this solution generally achieves mediocre performance, as label correlations are regrettably ignored. To ease this problem, a large number of multi-label learning approaches take into account label correlations explicitly or implicitly to improve the learning performance.


YuruGAN: Yuru-Chara Mascot Generator Using Generative Adversarial Networks With Clustering Small Dataset

arXiv.org Machine Learning

A yuru-chara is a mascot character created by local governments and companies for publicizing information on areas and products. Because it takes various costs to create a yuruchara, the utilization of machine learning techniques such as generative adversarial networks (GANs) can be expected. In recent years, it has been reported that the use of class conditions in a dataset for GANs training stabilizes learning and improves the quality of the generated images. However, it is difficult to apply class conditional GANs when the amount of original data is small and when a clear class is not given, such as a yuruchara image. In this paper, we propose a class conditional GAN based on clustering and data augmentation. Specifically, first, we performed clustering based on K-means++ on the yuru-chara image dataset and converted it into a class conditional dataset. Next, data augmentation was performed on the class conditional dataset so that the amount of data was increased five times. In addition, we built a model that incorporates ResBlock and self-attention into a network based on class conditional GAN and trained the class conditional yuru-chara dataset. As a result of evaluating the generated images, the effect on the generated images by the difference of the clustering method was confirmed.


Oversampling for Imbalanced Time Series Data

arXiv.org Machine Learning

Many important real-world applications involve time-series data with skewed distribution. Compared to conventional imbalance learning problems, the classification of imbalanced time-series data is more challenging due to high dimensionality and high inter-variable correlation. This paper proposes a structure preserving Oversampling method to combat the High-dimensional Imbalanced Time-series classification (OHIT). OHIT first leverages a density-ratio based shared nearest neighbor clustering algorithm to capture the modes of minority class in high-dimensional space. It then for each mode applies the shrinkage technique of large-dimensional covariance matrix to obtain accurate and reliable covariance structure. Finally, OHIT generates the structure-preserving synthetic samples based on multivariate Gaussian distribution by using the estimated covariance matrices. Experimental results on several publicly available time-series datasets (including unimodal and multi-modal) demonstrate the superiority of OHIT against the state-of-the-art oversampling algorithms in terms of F-value, G-mean, and AUC.


Local Model Feature Transformations

arXiv.org Machine Learning

Local learning methods are a popular class of machine learning algorithms. The basic idea for the entire cadre is to choose some non-local model family, to train many of them on small sections of neighboring data, and then to `stitch' the resulting models together in some way. Due to the limits of constraining a training dataset to a small neighborhood, research on locally-learned models has largely been restricted to simple model families. Also, since simple model families have no complex structure by design, this has limited use of the individual local models to predictive tasks. We hypothesize that, using a sufficiently complex local model family, various properties of the individual local models, such as their learned parameters, can be used as features for further learning. This dissertation improves upon the current state of research and works toward establishing this hypothesis by investigating algorithms for localization of more complex model families and by studying their applications beyond predictions as a feature extraction mechanism. We summarize this generic technique of using local models as a feature extraction step with the term ``local model feature transformations.'' In this document, we extend the local modeling paradigm to Gaussian processes, orthogonal quadric models and word embedding models, and extend the existing theory for localized linear classifiers. We then demonstrate applications of local model feature transformations to epileptic event classification from EEG readings, activity monitoring via chest accelerometry, 3D surface reconstruction, 3D point cloud segmentation, handwritten digit classification and event detection from Twitter feeds.


Augmentation of the Reconstruction Performance of Fuzzy C-Means with an Optimized Fuzzification Factor Vector

arXiv.org Artificial Intelligence

Information granules have been considered to be the fundamental constructs of Granular Computing (GrC). As a useful unsupervised learning technique, Fuzzy C-Means (FCM) is one of the most frequently used methods to construct information granules. The FCM-based granulation-degranulation mechanism plays a pivotal role in GrC. In this paper, to enhance the quality of the degranulation (reconstruction) process, we augment the FCM-based degranulation mechanism by introducing a vector of fuzzification factors (fuzzification factor vector) and setting up an adjustment mechanism to modify the prototypes and the partition matrix. The design is regarded as an optimization problem, which is guided by a reconstruction criterion. In the proposed scheme, the initial partition matrix and prototypes are generated by the FCM. Then a fuzzification factor vector is introduced to form an appropriate fuzzification factor for each cluster to build up an adjustment scheme of modifying the prototypes and the partition matrix. With the supervised learning mode of the granulation-degranulation process, we construct a composite objective function of the fuzzification factor vector, the prototypes and the partition matrix. Subsequently, the particle swarm optimization (PSO) is employed to optimize the fuzzification factor vector to refine the prototypes and develop the optimal partition matrix. Finally, the reconstruction performance of the FCM algorithm is enhanced. We offer a thorough analysis of the developed scheme. In particular, we show that the classical FCM algorithm forms a special case of the proposed scheme. Experiments completed for both synthetic and publicly available datasets show that the proposed approach outperforms the generic data reconstruction approach.


K-Means Clustering – One rule to group them all

#artificialintelligence

In machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want to know who are the people with low income power, the people with high or very high income power, which you think can be helpful in devising a perfect marketing strategy. You have the shopping data of customers with you and now you want to group customers with similar shopping preferences together or you being the bio student want to know which cells share similar properties based upon the data about the cells you have at your hand. All the above stated problems come under the domain of an unsupervised machine learning method called clustering. Although there are a number of clustering algorithms there but when it comes to the simplest one, the award will go to K-Means clustering.


Machine Learning Based Solutions for Security of Internet of Things (IoT): A Survey

arXiv.org Machine Learning

Over the last decade, IoT platforms have been developed into a global giant that grabs every aspect of our daily lives by advancing human life with its unaccountable smart services. Because of easy accessibility and fast-growing demand for smart devices and network, IoT is now facing more security challenges than ever before. There are existing security measures that can be applied to protect IoT. However, traditional techniques are not as efficient with the advancement booms as well as different attack types and their severeness. Thus, a strong-dynamically enhanced and up to date security system is required for next-generation IoT system. A huge technological advancement has been noticed in Machine Learning (ML) which has opened many possible research windows to address ongoing and future challenges in IoT. In order to detect attacks and identify abnormal behaviors of smart devices and networks, ML is being utilized as a powerful technology to fulfill this purpose. In this survey paper, the architecture of IoT is discussed, following a comprehensive literature review on ML approaches the importance of security of IoT in terms of different types of possible attacks. Moreover, ML-based potential solutions for IoT security has been presented and future challenges are discussed.


The Permuted Striped Block Model and its Factorization -- Algorithms with Recovery Guarantees

arXiv.org Machine Learning

We introduce a novel class of matrices which are defined by the factorization $\textbf{Y} :=\textbf{A}\textbf{X}$, where $\textbf{A}$ is an $m \times n$ wide sparse binary matrix with a fixed number $d$ nonzeros per column and $\textbf{X}$ is an $n \times N$ sparse real matrix whose columns have at most $k$ nonzeros and are $\textit{dissociated}$. Matrices defined by this factorization can be expressed as a sum of $n$ rank one sparse matrices, whose nonzero entries, under the appropriate permutations, form striped blocks - we therefore refer to them as Permuted Striped Block (PSB) matrices. We define the $\textit{PSB data model}$ as a particular distribution over this class of matrices, motivated by its implications for community detection, provable binary dictionary learning with real valued sparse coding, and blind combinatorial compressed sensing. For data matrices drawn from the PSB data model, we provide computationally efficient factorization algorithms which recover the generating factors with high probability from as few as $N =O\left(\frac{n}{k}\log^2(n)\right)$ data vectors, where $k$, $m$ and $n$ scale proportionally. Notably, these algorithms achieve optimal sample complexity up to logarithmic factors.


Learnable Subspace Clustering

arXiv.org Machine Learning

This paper studies the large-scale subspace clustering (LSSC) problem with million data points. Many popular subspace clustering methods cannot directly handle the LSSC problem although they have been considered as state-of-the-art methods for small-scale data points. A basic reason is that these methods often choose all data points as a big dictionary to build huge coding models, which results in a high time and space complexity. In this paper, we develop a learnable subspace clustering paradigm to efficiently solve the LSSC problem. The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces instead of the expensive costs of the classical coding models. Moreover, we propose a unified robust predictive coding machine (RPCM) to learn the parametric function, which can be solved by an alternating minimization algorithm. In addition, we provide a bounded contraction analysis of the parametric function. To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods. Experiments on million-scale datasets verify that our paradigm outperforms the related state-of-the-art methods in both efficiency and effectiveness.


Probabilistic embeddings for speaker diarization

arXiv.org Machine Learning

Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix, thus providing a path for the propagation of information about the quality of the speech segment into a PLDA scoring backend. These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments. The proposed probabilistic embeddings (x-vectors with precisions) are interfaced with the PLDA model by treating the x-vectors as hidden variables and marginalizing them out. We apply the proposed probabilistic embeddings as input to an agglomerative hierarchical clustering (AHC) algorithm to do diarization in the DIHARD'19 evaluation set. We compute the full PLDA likelihood 'by the book' for each clustering hypothesis that is considered by AHC. We do joint discriminative training of the PLDA parameters and of the probabilistic x-vector extractor. We demonstrate accuracy gains relative to a baseline AHC algorithm, applied to traditional xvectors (without uncertainty), and which uses averaging of binary log-likelihood-ratios, rather than by-the-book scoring.