Clustering
Moving Towards Open Set Incremental Learning: Readily Discovering New Authors
The classification of textual data often yields important information. Most classifiers work in a closed world setting where the classifier is trained on a known corpus, and then it is tested on unseen examples that belong to one of the classes seen during training. Despite the usefulness of this design, often there is a need to classify unseen examples that do not belong to any of the classes on which the classifier was trained. This paper describes the open set scenario where unseen examples from previously unseen classes are handled while testing. This further examines a process of enhanced open set classification with a deep neural network that discovers new classes by clustering the examples identified as belonging to unknown classes, followed by a process of retraining the classifier with newly recognized classes. Through this process the model moves to an incremental learning model where it continuously finds and learns from novel classes of data that have been identified automatically. This paper also develops a new metric that measures multiple attributes of clustering open set data. Multiple experiments across two author attribution data sets demonstrate the creation an incremental model that produces excellent results.
SoulMate: Short-text author linking through Multi-aspect temporal-textual embedding
Najafipour, Saeed, Hosseini, Saeid, Hua, Wen, Kangavari, Mohammad Reza, Zhou, Xiaofang
Linking authors of short-text contents has important usages in many applications, including Named Entity Recognition (NER) and human community detection. However, certain challenges lie ahead. Firstly, the input short-text contents are noisy, ambiguous, and do not follow the grammatical rules. Secondly, traditional text mining methods fail to effectively extract concepts through words and phrases. Thirdly, the textual contents are temporally skewed, which can affect the semantic understanding by multiple time facets. Finally, using the complementary knowledge-bases makes the results biased to the content of the external database and deviates the understanding and interpretation away from the real nature of the given short text corpus. To overcome these challenges, we devise a neural network-based temporal-textual framework that generates the tightly connected author subgraphs from microblog short-text contents. Our approach, on the one hand, computes the relevance score (edge weight) between the authors through considering a portmanteau of contents and concepts, and on the other hand, employs a stack-wise graph cutting algorithm to extract the communities of the related authors. Experimental results show that compared to other knowledge-centered competitors, our multi-aspect vector space model can achieve a higher performance in linking short-text authors. Additionally, given the author linking task, the more comprehensive the dataset is, the higher the significance of the extracted concepts will be.
Explaining Machine Learning to my Grandmother - WebSystemer.no
I chose that title in order to show to all of the readers that I'm trying to explain a complex concept as easy as possible. I also want to point out that I'm no expert in Machine Learning but I have been familiar with it for over 2 years now and currently I'm enrolled at Holberton School studying software engineering. Having said that, let me introduce some of the topics we're going to review. According to Expert System, "Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves."
Comparison of Different Spike Sorting Subtechniques Based on Rat Brain Basolateral Amygdala Neuronal Activity
Hojjatinia, Sahar, Lagoa, Constantino M.
Developing electrophysiological recordings of brain neuronal activity and their analysis provide a basis for exploring the structure of brain function and nervous system investigation. The recorded signals are typically a combination of spikes and noise. High amounts of background noise and possibility of electric signaling recording from several neurons adjacent to the recording site have led scientists to develop neuronal signal processing tools such as spike sorting to facilitate brain data analysis. Spike sorting plays a pivotal role in understanding the electrophysiological activity of neuronal networks. This process prepares recorded data for interpretations of neurons interactions and understanding the overall structure of brain functions. Spike sorting consists of three steps: spike detection, feature extraction, and spike clustering. There are several methods to implement each of spike sorting steps. This paper provides a systematic comparison of various spike sorting sub-techniques applied to real extracellularly recorded data from a rat brain basolateral amygdala. An efficient sorted data resulted from careful choice of spike sorting sub-methods leads to better interpretation of the brain structures connectivity under different conditions, which is a very sensitive concept in diagnosis and treatment of neurological disorders. Here, spike detection is performed by appropriate choice of threshold level via three different approaches. Feature extraction is done through PCA and Kernel PCA methods, which Kernel PCA outperforms. We have applied four different algorithms for spike clustering including K-means, Fuzzy C-means, Bayesian and Fuzzy maximum likelihood estimation. As one requirement of most clustering algorithms, optimal number of clusters is achieved through validity indices for each method. Finally, the sorting results are evaluated using inter-spike interval histograms.
Unsupervised Space-Time Clustering using Persistent Homology
This paper presents a new clustering algorithm for space-time data based on the concepts of topological data analysis and in particular, persistent homology. Employing persistent homology - a flexible mathematical tool from algebraic topology used to extract topological information from data - in unsupervised learning is an uncommon and a novel approach. A notable aspect of this methodology consists in analyzing data at multiple resolutions which allows to distinguish true features from noise based on the extent of their persistence. We evaluate the performance of our algorithm on synthetic data and compare it to other well-known clustering algorithms such as K-means, hierarchical clustering and DBSCAN. We illustrate its application in the context of a case study of water quality in the Chesapeake Bay.
Characterization and Development of Average Silhouette Width Clustering
Batool, Fatima, Hennig, Christian
The purpose of this paper is to introduced a new clustering methodology. This paper is divided into three parts. In the first part we have developed the axiomatic theory for the average silhouette width (ASW) index. There are different ways to investigate the quality and characteristics of clustering methods such as validation indices using simulations and real data experiments, model-based theory, and non-model-based theory known as the axiomatic theory. In this work we have not only taken the empirical approach of validation of clustering results through simulations, but also focus on the development of the axiomatic theory. In the second part we have presented a novel clustering methodology based on the optimization of the ASW index. We have considered the problem of estimation of number of clusters and finding clustering against this number simultaneously. Two algorithms are proposed. The proposed algorithms are evaluated against several partitioning and hierarchical clustering methods. An intensive empirical comparison of the different distance metrics on the various clustering methods is conducted. In the third part we have considered two application domains\textemdash novel single cell RNA sequencing datasets and rainfall data to cluster weather stations.
Multiple Sample Clustering
The clustering algorithms that view each object data as a single sample drawn from a certain distribution, Gaussian distribution, for example, has been a hot topic for decades. Many clustering algorithms: such as k-means and spectral clustering are proposed based on the single sample assumption. However, in real life, each input object can usually be the multiple samples drawn from a certain hidden distribution. The traditional clustering algorithms cannot handle such a situation. This calls for the multiple sample clustering algorithm. But the traditional multiple sample clustering algorithms can only handle scalar samples or samples from Gaussian distribution. This constrains the application field of multiple sample clustering algorithms. In this paper, we purpose a general framework for multiple sample clustering. Various algorithms can be generated by this framework. We apply two specific cases of this framework: Wasserstein distance version and Bhattacharyya distance version on both synthetic data and stock price data. The simulation results show that the sufficient statistic can greatly improve the clustering accuracy and stability.
Regression-clustering for Improved Accuracy and Training Cost with Molecular-Orbital-Based Machine Learning
Cheng, Lixue, Kovachki, Nikola B., Welborn, Matthew, Miller, Thomas F. III
Machine learning (ML) in the representation of molecular-orbital-based (MOB) features has been shown to be an accurate and transferable approach to the prediction of post-Hartree-Fock correlation energies. Previous applications of MOB-ML employed Gaussian Process Regression (GPR), which provides good prediction accuracy with small training sets; however, the cost of GPR training scales cubically with the amount of data and becomes a computational bottleneck for large training sets. In the current work, we address this problem by introducing a clustering/regression/classification implementation of MOB-ML. In a first step, regression clustering (RC) is used to partition the training data to best fit an ensemble of linear regression (LR) models; in a second step, each cluster is regressed independently, using either LR or GPR; and in a third step, a random forest classifier (RFC) is trained for the prediction of cluster assignments based on MOB feature values. Upon inspection, RC is found to recapitulate chemically intuitive groupings of the frontier molecular orbitals, and the combined RC/LR/RFC and RC/GPR/RFC implementations of MOB-ML are found to provide good prediction accuracy with greatly reduced wall-clock training times. For a dataset of thermalized geometries of 7211 organic molecules of up to seven heavy atoms, both implementations reach chemical accuracy (1 kcal/mol error) with only 300 training molecules, while providing 35000-fold and 4500-fold reductions in the wall-clock training time, respectively, compared to MOB-ML without clustering. The resulting models are also demonstrated to retain transferability for the prediction of large-molecule energies with only small-molecule training data. Finally, it is shown that capping the number of training datapoints per cluster leads to further improvements in prediction accuracy with negligible increases in wall-clock training time.
Bootstrapping deep music separation from primitive auditory grouping principles
Seetharaman, Prem, Wichern, Gordon, Roux, Jonathan Le, Pardo, Bryan
Separating an audio scene such as a cocktail party into constituent, meaningful components is a core task in computer audition. Deep networks are the state-of-the-art approach. They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of available audio is not isolated. The brain uses primitive cues that are independent of the characteristics of any particular sound source to perform an initial segmentation of the audio scene. We present a method for bootstrapping a deep model for music source separation without ground truth by using multiple primitive cues. We apply our method to train a network on a large set of unlabeled music recordings from YouTube to separate vocals from accompaniment without the need for ground truth isolated sources or artificial training mixtures.
Deep Clustering of Compressed Variational Embeddings
Wu, Suya, Diao, Enmao, Ding, Jie, Tarokh, Vahid
ABSTRACT Motivated by the ever-increasing demands for limited communication bandwidth and low-power consumption, we propose a new methodology, named joint V ariational Autoen-coders with Bernoulli mixture models (V AB), for performing clustering in the compressed data domain. The idea is to reduce the data dimension by V ariational Autoencoders (V AEs) and group data representations by Bernoulli mixture models (BMMs). Once jointly trained for compression and clustering, the model can be decomposed into two parts: a data vendor that encodes the raw data into compressed data, and a data consumer that classifies the received (compressed) data. To enable training using the gradient descent algorithm, we propose to use the Gumbel-Softmax distribution to resolve the infeasibility of the back-propagation algorithm when assessing categorical samples. Index T erms -- Clustering, V ariational Autoencoder (V AE), Bernoulli Mixture Model (BMM) 1. INTRODUCTION Clustering is a fundamental task with applications in medical imaging, social network analysis, bioinformatics, computer graphics, etc. Applying classical clustering methods directly to high dimensional data may be computational inefficient and suffer from instability.